COMPUTATIONAL PREDICTION AND CHARACTERIZATION OF A TRANSCRIPTOME USING CASSAVA (MANIHOT ESCULENTA) RNA-SEQ DATA

Size: px
Start display at page:

Download "COMPUTATIONAL PREDICTION AND CHARACTERIZATION OF A TRANSCRIPTOME USING CASSAVA (MANIHOT ESCULENTA) RNA-SEQ DATA"

Transcription

1 COMPUTATIONAL PREDICTION AND CHARACTERIZATION OF A TRANSCRIPTOME USING CASSAVA (MANIHOT ESCULENTA) RNA-SEQ DATA AOBAKWE MATSHIDISO, SCOTT HAZELHURST, CHRISSIE REY Wits Bioinformatics, University of the Witwatersrand, Johannesburg 1 Department of Plant Biotechnology, School of Molecular and Cell Biology, University of the Witwatersrand, Johannesburg 2

2

3 A TRANSCRIPTOME rrna A collection of transcripts in a cell, organ, organism, etc. sirna mrna trna strna mirna transcriptome snrna

4 IMPORTANCE OF A TRANSCRIPTOME Understanding of genetic variants Decipher the complexity of the genome Explore genetic activity and function Gain insights into biological pathways, cellular mechanisms and interactions DNA Pre-RNA mrna Identify splicing events and allelic expression patterns Proteins

5 OBJECTIVES Realise an efficient analysis pipeline to characterise a plant transcriptome Implement the pipeline to adequately characterize a plant transcriptome to gain insights and knowledge regarding polymorphisms, variations and biological phenomena

6 DATASET Cassava, tapioca, yuca, manioc, majumbura, madumbe (Manihot esculenta Crantz) Two cultivars, resistant (T200) and susceptible (TME3) to cassava mosaic disease (SACMD) Total RNA from leaf tissue, Over three timepoints: 12, 32 and 67 days post inoculation Applied Biosystems SOLiD 4 System

7 THE IMPORTANCE OF CASSAVA Staple food for over 850 million people Used for bread, cake, beer, chips, beverages, etc also as wheat flour and potato replacement Industrial: Biofuels, Paper, etc More carbohydrates per hectare than most crops, low maintenance crop Problem: severely affected by the SA Cassava Mosaic Virus

8 CHALLENGES Cassava polyploidy (2n =36) Genome is ~30-40% annotated Colour space sequenced reads

9

10 HIGH PERFORMANCE COMPUTING PLATFORM Ubuntu Lucid Lynx Virtual Machine 12 Processor Cores 72 GB RAM 100 core cluster

11 THE ANALYSIS PIPELINE Sequence Reads SOLiD 4 System Read Enhancement SAET Adaptor Removal CUTADAPT De Novo Assembly Velvet, ABySS, Kanga Pre-processing PRINSEQ Alignment to Reference BOWTIE, Novo, SOCS, BWA Splice Junction Mapping TOPHAT Differential Expression CUFFLINKS Downstream Analysis BLAST, KEGG, GO, Uniprot, etc Visualization UCSC Genome Browser, IGV

12 QUALITY CONTROL EA F3 EA F5-BC ED F3 ED F5-BC F1 F3 F1 F5-BC F4 F3 F4 F5-BC Number of Reads 58,133,361 58,133,361 41,441,387 41,441,387 33,000,225 33,000,225 25,591,844 25,591,844 Average length of read %GC Content EC F3 EC F5 BC EF F3 EF F5-BC F3 F3 F3 F5 BC F6 F3 F6 F5-BC Number of Reads 45,308,104 45,308,104 39,445,038 39,445,038 33,539,947 33,539,947 33,335,491 33,335,491 Average length of read %GC Content

13 MAPPING QUALITY Bowtie, Enhanced with Default settings dpi Bowtie, Enhanced with Stricter settings dpi NovoAlign, Raw with Default settings dpi

14 SEQUENCE ALIGNMENT (BLAST) NR-nucleotide database with Blastx e-value 1e-5 Resistant 12 dpi

15 KEGG GENE Resistant 12 dpi

16 KEGG ORGANISMS Resistant 12 dpi

17 UNIPROT PROTEIN PREDICTIONS Resistant 12 dpi

18 GENE ONTOLOGIES Resistant 67 dpi Molecular Function Biological Process Cellular Component GO: antioxidant activity GO: auxiliary transport protein activity GO: binding GO: catalytic activity GO: chaperone regulator activity GO: chemoattractant activity GO: chemorepellent activity GO: enzyme regulator activity GO: metallochaperone activity GO: molecular transducer activity GO: motor activity GO: nutrient reservoir activity GO: protein tag GO: structural molecule activity GO: transcription regulator activity GO: translation regulator activity GO: transporter activity GO: biological adhesion GO: biological regulation GO: cell killing GO: cellular process GO: developmental process GO: establishment of localization GO: gene expression GO: growth GO: immune system process GO: localization GO: locomotion GO: metabolic process GO: multi-organism process GO: multicellular organismal process GO: pigmentation GO: reproduction GO: response to stimulus GO: viral reproduction GO: cell GO: cell part GO: envelope GO: extracellular matrix GO: extracellular matrix part GO: extracellular region GO: extracellular region part GO: macromolecular complex GO: membrane-enclosed lumen GO: organelle GO: organelle part GO: symplast GO: synapse GO: synapse part GO: virion GO: virion part

19 GENE MODELS Susceptible 32 dpi A. Thaliana: chr5:19,948,455-20,000,800

20 QUANTILE NORMALIZATION Resistant 12 dpi versus Susceptible 67 dpi Resistant 67 dpi versus Susceptible 67 dpi Resistant 67 dpi versus Susceptible 67 dpi

21 Gene Expression Levels NAÏVE EXPRESSION ANALYSIS Resistant 12 dpi versus Susceptible 67 dpi Resistant 67 dpi versus Susceptible 67 dpi Resistant 67 dpi versus Susceptible 67 dpi

22 WORK IN PROGRESS Improving alignments and assemblies Scaffolding and finishing Alignment and mapping to other genomes Inferring orthology, paralogy and synteny Pathway analyses Validation with qpcr

23 ACKNOWLEDGEMENTS THANK YOU