Genomics and Transcriptomics of Spirodela polyrhiza

Similar documents
Figure S1. Data flow of de novo genome assembly using next generation sequencing data from multiple platforms.

Genome Assembly With Next Generation Sequencers

Transcriptomics analysis with RNA seq: an overview Frederik Coppens

Supplementary Table 1. Summary of whole genome shotgun sequence used for genome assembly

How much sequencing do I need? Emily Crisovan Genomics Core September 26, 2018

How much sequencing do I need? Emily Crisovan Genomics Core

The genome of Fraxinus excelsior (European Ash)

De novo assembly in RNA-seq analysis.

DNA. bioinformatics. genomics. personalized. variation NGS. trio. custom. assembly gene. tumor-normal. de novo. structural variation indel.

Introduction to RNA-Seq. David Wood Winter School in Mathematics and Computational Biology July 1, 2013

Transcriptome Assembly, Functional Annotation (and a few other related thoughts)

RNA-Sequencing analysis

Data Analysis with CASAVA v1.8 and the MiSeq Reporter

RNA-SEQUENCING ANALYSIS

Mapping strategies for sequence reads

02 Agenda Item 03 Agenda Item

Analysis of RNA-seq Data

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

resequencing storage SNP ncrna metagenomics private trio de novo exome ncrna RNA DNA bioinformatics RNA-seq comparative genomics

MAKER: An easy to use genome annotation pipeline. Carson Holt Yandell Lab Department of Human Genetics University of Utah

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park

Supplemental Figure 1. Phylogenetic relationship of 128 LCAT-like sequences from 38 plant species. The maximum likelihood tree was generated using

Sequencing the genomes of Nicotiana sylvestris and Nicotiana tomentosiformis Nicolas Sierro

NGS part 2: applications. Tobias Österlund

Wheat Genome Structural Annotation Using a Modular and Evidence-combined Annotation Pipeline

Transcriptome Assembly and Evaluation, using Sequencing Quality Control (SEQC) Data

Haploid Assembly of Diploid Genomes

De novo whole genome assembly

SUPPLEMENTARY MATERIAL FOR THE PAPER: RASCAF: IMPROVING GENOME ASSEMBLY WITH RNA-SEQ DATA

De novo meta-assembly of ultra-deep sequencing data

Introduction to RNA-Seq in GeneSpring NGS Software

RNA-Seq with the Tuxedo Suite

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Genome Assembly and Annotation of Isochrysis Galbana


Mate-pair library data improves genome assembly

pmyrsaur78 psosein4+ psosetr1+ psosetr2+

RNASEQ WITHOUT A REFERENCE

NEXT GENERATION SEQUENCING. Farhat Habib

Bioinformatics in next generation sequencing projects

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Eucalyptus gene assembly

Analytics Behind Genomic Testing

Genomic resources. for non-model systems


Rapid Transcriptome Characterization for a nonmodel organism using 454 pyrosequencing

Overview of the next two hours...

A draft sequence of bread wheat chromosome 7B based on individual MTP BAC sequencing using pair end and mate pair libraries.

Introduction to RNA sequencing

Taking Advantage of Long RNA-Seq Reads. Vince Magrini Pacific Biosciences User Group Meeting September 18, 2013

Combined final report: genome and transcriptome assemblies

EvidentialGene: Perfect Genes Constructed from Gigabases of RNA Gilbert, Don Indiana University, Biology, Bloomington, IN 47405;

Gene Prediction Group

Introduction to RNAseq Analysis. Milena Kraus Apr 18, 2016

Introduction to Sequencher. Tom Randall Center for Bioinformatics

RNA-sequencing. Next Generation sequencing analysis Anne-Mette Bjerregaard. Center for biological sequence analysis (CBS)

Nature Biotechnology: doi: /nbt.3943

RNA-Seq Software, Tools, and Workflows

Contact us for more information and a quotation

Sequence Based Function Annotation

Transcriptome analysis

ChIP-seq and RNA-seq. Farhat Habib

Assessing De-Novo Transcriptome Assemblies

Draft 3 Annotation of DGA06H06, Contig 1 Jeannette Wong Bio4342W 27 April 2009

High throughput sequencing technologies

About Strand NGS. Strand Genomics, Inc All rights reserved.

De novo genome assembly with next generation sequencing data!! "

High Throughput Sequencing the Multi-Tool of Life Sciences. Lutz Froenicke DNA Technologies and Expression Analysis Cores UCD Genome Center

RNA-Seq analysis workshop

CBC Data Therapy. Metatranscriptomics Discussion

FRAUNHOFER INSTITUTE FOR INTERFACIAL ENGINEERING AND BIOTECHNOLOGY IGB NEXT-GENERATION SEQUENCING. From wet lab to dry lab complete sample analysis

Next Generation Sequencing. Tobias Österlund

GENOME ASSEMBLY FINAL PIPELINE AND RESULTS

Introduction to the MiSeq

De Novo Assembly of High-throughput Short Read Sequences

De Novo Transcript Discovery using Long and Short Reads

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Next-generation sequencing technologies

NGS Data Analysis and Galaxy

Genome sequencing in Senecio squalidus

Compute- and Data-Intensive Analyses in Bioinformatics"

RNA Sequencing Analyses & Mapping Uncertainty

Introduction to Bioinformatics

Applications of Next Generation Sequencing in Metagenomics Studies

Form for publishing your article on BiotechArticles.com this document to

SCIENCE CHINA Life Sciences. Comparative analysis of de novo transcriptome assembly

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

Quantifying gene expression

The Diploid Genome Sequence of an Individual Human

Wet-lab Considerations for Illumina data analysis

Genome Assembly Software for Different Technology Platforms. PacBio Canu Falcon. Illumina Soap Denovo Discovar Platinus MaSuRCA.

SUPPLEMENTARY INFORMATION

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Human Genome Sequencing Over the Decades The capacity to sequence all 3.2 billion bases of the human genome (at 30X coverage) has increased

Whole Transcriptome Analysis of Illumina RNA- Seq Data. Ryan Peters Field Application Specialist

Workflows and Pipelines for NGS analysis: Lessons from proteomics

Genome annotation & EST

1 st transplant user training workshop Versailles, 12th-13th November 2012

De Novo Assembly (Pseudomonas aeruginosa MAPO1 ) Sample to Insight

Transcription:

Genomics and Transcriptomics of Spirodela polyrhiza Doug Bryant Bioinformatics Core Facility & Todd Mockler Group, Donald Danforth Plant Science Center

Desired Outcomes High-quality genomic reference sequence Transcriptome definition, functional annotation Comparison of several additional accessions

Genomic, RNA-Seq Data Spirodela accession 9509 deeply sequenced Additional 8 accessions, low coverage RNA-seq obtained from 9509 and two other accessions under Control and ABA conditions Kuehdorf, Jetschke, Ballani, and Appenroth 2013

Analysis Strategy Genome data acquisition Transcriptome data acquisition Data quality control Genome, transcriptome assembly Genome structural annotation Transcriptome functional annotation Differential expression analysis*

Genome

Data Acquisition Genomic Illumina HiSeq Diverse library set Overlap 300-500 bp Several mate-pair Illumina HiSeq 2000

Quality Control Raw Data Visualize Adaptors Verify Insert Sizes Retain Pairs Only Trim 3 Low Quality Insert Size Stdev Read Length Trimmed Avg Read Length Read Pairs Passed QC Coverage @ 329 Mbp 388 43 101 100.79 28,630,414 17.54 490 168 101 100.72 33,683,677 20.62 228 31 101 100.44 41,684,344 25.45 166 17 101 99.54 91,328,045 55.26 166 17 101 99.62 100,172,620 60.66 217 17 101 83.45 44,051,527 22.35 4660 1110 101 100.12 177,983,251 108.33 4500 151 150.74 6,782,583 6.22 524,316,461 316.43

Genomic Data Insert Sizes Distribution, 9509 20,000 bp, 26, 23% 180 bp, 31, 28% 5,000 bp, 11, 10% 2,000 bp, 20, 17% 500 bp, 25, 22% Insert size, estimated coverage, fraction of total data.

Genome Assembly Several iterations Preliminary assemblies with Velvet, SOAPdenovo Final assembly with AllPathsLG Polished with SSPACE

Genome Assembly Statistics Assembly 9509 (Mbp) (152 exp.) 146 (96% of exp.) Scaffolds (#) 774 Scaffolds >= 1 Mbp (#) 32 (4.13%) N50 scaffold length (bp) 4,305,909 L50 scaffold (#) 11 N90 scaffold length (bp) 1,428,181 L90 scaffold (#) 31 Ns (%) 7.7

Genomic Physical Coverage Physical Coverage by Library (Total: 370x) Coverage (x) 200 180 160 140 120 100 80 60 40 20 0 180bp 500bp 2,000bp 5,000bp 20,000bp

Genome Assembly Quality Assessment Reads used in assembly? Reads align to assembly? Core eukaryotic genes present?

Genomic Reads Used, Aligned 180bp 500bp 2,000bp 5,000bp 20,000bp 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Reads Used (%) Reads Align (%)

Core Eukaryotic Genes Core Eukaryotic Genes Mapping Approach (CEGMA), Korf lab (korflab.ucdavis.edu/) Search genome for 248 low copy, highly conserved genes Assess completeness of genome

Core Eukaryotic Genes Number of Core Genes Identified 250 200 150 100 50 0 246 A.thaliana (99.60%) Complete 247 246 237 241 241 226 215 B.distachyon (99.19%) Partial Z.mays (97.18%) Species (core genes at least partial %) S.polyrhiza (97.18%)

Resequencing

Resequencing Kuehdorf, Jetschke, Ballani, and Appenroth 2013

Resequencing Kuehdorf, Jetschke, Ballani, and Appenroth, 2013

Resequencing Data 120 Sequencing Depth of Coverage Coverage Depth 100 80 60 40 20 0 9509 9504 9506 9316 9242 9502 9511 9512 9501 Strain

Resequencing Variation SNP/INDEL Rate Per Accession 600,000 SNP Positions INDEL Positions 500,000 Num. Positions 400,000 300,000 200,000 100,000 0 9509 (0.12%) 9504 (0.39%) 9506 (0.35%) 9316 (0.37%) 9242 (0.37%) 9502 (0.20%) 9511 (0.21%) 9512 (0.29%) 9501 (0.20%) Accession (Total Variant Positions %)

Resequencing Assemblies Per accession: ~30x coverage, single library Assembled each using Velvet Mean assembled size: 128 Mbp (~84%) (stdev: 6 Mbp) Mean N50: 15kb (stdev: 1.5kb) Nearly all contigs (>98%) align to 9509 genome assembly Defining structural differences in progress

Transcriptome

RNA-Seq Data Kuehdorf, Jetschke, Ballani, and Appenroth, 2013

RNA-Seq Data Kuehdorf, Jetschke, Ballani, and Appenroth, 2013

RNA-Seq Data 250 RNA-Seq Reads per Accession and Treatment No. 101 bp Reads (M) 200 150 100 50 0 9509 Control 9509 ABA 9316 Control 9316 ABA 9501 Control 9501 ABA

Transcriptome Discovery 1. Reference-guided assembly Tophat2 Cufflinks2 2. De novo predictions Maker, informed by assembly SNAP, Augustus, GeneMarkHMM Iteratively trained SNAP

(1) Reference-Guided Transcriptome Assembly Align each RNA-seq library (6) to genome For each, define transcripts based on alignments Merge resulting assemblies to discover gene models, alternative splicing Output: Gene, transcripts annotation (GFF3) Transcripts (FASTA)

(2) De novo Transcriptome Discovery Discover genes not expressed in RNA-seq experiments Train algorithms on reference-guided assembly 1. Call high-confidence open reading frames in transcript sequences 2. Use transcripts and translated proteins to inform and train de novo gene callers 3. Iteratively train SNAP on resulting output

(2) De novo Transcriptome Assembled: 25,090 loci 41,884 transcripts Discovery Of 41,884 transcripts, complete ORF and at least 33 amino acids: 39,076 Initial training using these transcripts and proteins

Transcriptome Discovery: Results Preliminary maker output: 28,600 genes Prune: Must have RNA-seq evidence across >= 50% or, >= 100 amino acids with complete ORF Prune bacterial scaffolds Final gene set: 23,495 genes Transcriptome size (nucleic acids): 33 Mbp Mean protein length: 358 amino acids 19,380 (82%) have functional prediction from BLASTP and/or InterProScan

Transcriptome Functional Annotation BlastP (77%) 1,270 74 InterProScan (66%) 453 14,066 2,605 912 3,238 (89%) RNA-Seq Evidence 877 (3.7%)

Transcriptome Annotation Brachypodium distachyon Sorghum bicolor Cicer arietinum Setaria italica Solanum lycopersicum Fragaria vesca Zea mays Cucumis sativus Ricinus communis Glycine max Prunus persica Populus trichocarpa Oryza sativa Theobroma cacao Vitis vinifera Annotations by Species

Alternative Splicing Genes With Num. Isoforms Num. Genes with Num. Isoforms (log 10) 10000 1000 100 10 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Num. Isoforms

Identify Differentially Expressed Genes 250 RNA-Seq Reads per Accession and Treatment No. 101 bp Reads (M) 200 150 100 50 0 9509 Control 9509 ABA 9316 Control 9316 ABA 9501 Control 9501 ABA

Differentially Expressed Genes 1,727 genes identified as significantly differentially expressed 1,105 isoforms identified as significant Molecular verification in progress

Ongoing Assemble repetitive elements Assemble, annotate mitochondria, chloroplast Accessions, structural differences Molecular investigation of differentially expressed genes of interest

Thank you