COMPUTATIONAL PREDICTION AND CHARACTERIZATION OF A TRANSCRIPTOME USING CASSAVA (MANIHOT ESCULENTA) RNA-SEQ DATA

Similar documents
Transcription:

COMPUTATIONAL PREDICTION AND CHARACTERIZATION OF A TRANSCRIPTOME USING CASSAVA (MANIHOT ESCULENTA) RNA-SEQ DATA AOBAKWE MATSHIDISO, SCOTT HAZELHURST, CHRISSIE REY Wits Bioinformatics, University of the Witwatersrand, Johannesburg 1 Department of Plant Biotechnology, School of Molecular and Cell Biology, University of the Witwatersrand, Johannesburg 2

A TRANSCRIPTOME rrna A collection of transcripts in a cell, organ, organism, etc. sirna mrna trna strna mirna transcriptome snrna

IMPORTANCE OF A TRANSCRIPTOME Understanding of genetic variants Decipher the complexity of the genome Explore genetic activity and function Gain insights into biological pathways, cellular mechanisms and interactions DNA Pre-RNA mrna Identify splicing events and allelic expression patterns Proteins

OBJECTIVES Realise an efficient analysis pipeline to characterise a plant transcriptome Implement the pipeline to adequately characterize a plant transcriptome to gain insights and knowledge regarding polymorphisms, variations and biological phenomena

DATASET Cassava, tapioca, yuca, manioc, majumbura, madumbe (Manihot esculenta Crantz) Two cultivars, resistant (T200) and susceptible (TME3) to cassava mosaic disease (SACMD) Total RNA from leaf tissue, Over three timepoints: 12, 32 and 67 days post inoculation Applied Biosystems SOLiD 4 System

THE IMPORTANCE OF CASSAVA Staple food for over 850 million people Used for bread, cake, beer, chips, beverages, etc also as wheat flour and potato replacement Industrial: Biofuels, Paper, etc More carbohydrates per hectare than most crops, low maintenance crop Problem: severely affected by the SA Cassava Mosaic Virus

CHALLENGES Cassava polyploidy (2n =36) Genome is ~30-40% annotated Colour space sequenced reads

HIGH PERFORMANCE COMPUTING PLATFORM Ubuntu Lucid Lynx Virtual Machine 12 Processor Cores 72 GB RAM 100 core cluster

THE ANALYSIS PIPELINE Sequence Reads SOLiD 4 System Read Enhancement SAET Adaptor Removal CUTADAPT De Novo Assembly Velvet, ABySS, Kanga Pre-processing PRINSEQ Alignment to Reference BOWTIE, Novo, SOCS, BWA Splice Junction Mapping TOPHAT Differential Expression CUFFLINKS Downstream Analysis BLAST, KEGG, GO, Uniprot, etc Visualization UCSC Genome Browser, IGV

QUALITY CONTROL EA F3 EA F5-BC ED F3 ED F5-BC F1 F3 F1 F5-BC F4 F3 F4 F5-BC Number of Reads 58,133,361 58,133,361 41,441,387 41,441,387 33,000,225 33,000,225 25,591,844 25,591,844 Average length of read 50 35 50 35 50 35 50 35 %GC Content 48 50 49 51 49 53 50 55 EC F3 EC F5 BC EF F3 EF F5-BC F3 F3 F3 F5 BC F6 F3 F6 F5-BC Number of Reads 45,308,104 45,308,104 39,445,038 39,445,038 33,539,947 33,539,947 33,335,491 33,335,491 Average length of read 50 35 50 35 50 35 50 35 %GC Content 49 53 49 52 48 51 50 52

MAPPING QUALITY Bowtie, Enhanced with Default settings control @12 dpi Bowtie, Enhanced with Stricter settings control @12 dpi NovoAlign, Raw with Default settings control @12 dpi

SEQUENCE ALIGNMENT (BLAST) NR-nucleotide database with Blastx e-value 1e-5 Resistant Control @ 12 dpi

KEGG GENE Resistant Control @ 12 dpi

KEGG ORGANISMS Resistant Control @ 12 dpi

UNIPROT PROTEIN PREDICTIONS Resistant Control @ 12 dpi

GENE ONTOLOGIES Resistant Infected @ 67 dpi Molecular Function Biological Process Cellular Component GO:0016209 antioxidant activity GO:0015457 auxiliary transport protein activity GO:0005488 binding GO:0003824 catalytic activity GO:0030188 chaperone regulator activity GO:0042056 chemoattractant activity GO:0045499 chemorepellent activity GO:0030234 enzyme regulator activity GO:0016530 metallochaperone activity GO:0060089 molecular transducer activity GO:0003774 motor activity GO:0045735 nutrient reservoir activity GO:0031386 protein tag GO:0005198 structural molecule activity GO:0030528 transcription regulator activity GO:0045182 translation regulator activity GO:0005215 transporter activity GO:0022610 biological adhesion GO:0065007 biological regulation GO:0001906 cell killing GO:0009987 cellular process GO:0032502 developmental process GO:0051234 establishment of localization GO:0010467 gene expression GO:0040007 growth GO:0002376 immune system process GO:0051179 localization GO:0040011 locomotion GO:0008152 metabolic process GO:0051704 multi-organism process GO:0032501 multicellular organismal process GO:0043473 pigmentation GO:0000003 reproduction GO:0050896 response to stimulus GO:0016032 viral reproduction GO:0005623 cell GO:0044464 cell part GO:0031975 envelope GO:0031012 extracellular matrix GO:0044420 extracellular matrix part GO:0005576 extracellular region GO:0044421 extracellular region part GO:0032991 macromolecular complex GO:0031974 membrane-enclosed lumen GO:0043226 organelle GO:0044422 organelle part GO:0055044 symplast GO:0045202 synapse GO:0044456 synapse part GO:0019012 virion GO:0044423 virion part

GENE MODELS Susceptible Control @ 32 dpi A. Thaliana: chr5:19,948,455-20,000,800

QUANTILE NORMALIZATION Resistant Control @ 12 dpi versus Susceptible Infected @ 67 dpi Resistant Control @ 67 dpi versus Susceptible Infected @ 67 dpi Resistant Infected @ 67 dpi versus Susceptible Infected @ 67 dpi

Gene Expression Levels NAÏVE EXPRESSION ANALYSIS Resistant Control @ 12 dpi versus Susceptible Infected @ 67 dpi Resistant Control @ 67 dpi versus Susceptible Infected @ 67 dpi Resistant Infected @ 67 dpi versus Susceptible Infected @ 67 dpi

WORK IN PROGRESS Improving alignments and assemblies Scaffolding and finishing Alignment and mapping to other genomes Inferring orthology, paralogy and synteny Pathway analyses Validation with qpcr

ACKNOWLEDGEMENTS THANK YOU