Combined final report: genome and transcriptome assemblies

Size: px
Start display at page:

Download "Combined final report: genome and transcriptome assemblies"

Transcription

1 Combined final report: genome and transcriptome assemblies Nadia Fernandez- Trinity assembly, RSEM, Tophat and Cufflinks/Cuffmerge/Cuffdiff pipeline, and MAKER annotation Stephanie Gutierrez Avril Harder some QC steps for reads/transcripts, SOAPdenovo2, HTSeq, DESeq2/edgeR for reference- and denovo-assembled transcriptomes Samarth Mathur- Generated genome assemblies (Using AbySS and SOAPdenovo) using trimmed cleaned reads and merged overlapped (using FLASH) reads, merged genome assemblies using GAM-NGS, Quast Stats for generated assmeblies, QC/contamination cleaning for transcripts, analysed Cuffdiff RNA-Seq differential expression data using cummerbund package, Expression analysis for 3 chosen DEGs and hypothesis formation. Alex Martinez After DGE analyses were completed, I was responsible for examining our list of differentially expressed genes along with two other group members and choosing 3 genes of interest pertaining to M. roreri mating type. Once our 3 genes were chosen, I was responsible for researching biological pathways and functions in which our genes may be involved. Finally, I was responsible for developing a story detailing potential roles our genes might serve in regard to mating type and reproduction in M. roreri. Genome Assembly Adapters and low quality sequences were removed from raw mate pair reads using Trimmomatic, and contaminant sequences ( M. roreri mitochon drial genome and PhiX sequences) were removed using Bowtie 2. Read quality was checked before and after running Trimmomatic ( e.g., Fig. 1). Proportions of reads surviving each quality control step are outlined in Table 1. Tadpole was used to error-correct paired-end reads shared by another group, with full corrections applied to 23,898,406 reads and partial corrections applied to 862,524 reads. Table 1. Number of mate pair reads at each quality control step. MP_1 paired MP_1 unpaired MP_2 paired MP_2 unpaired Initial # read with Nextera adapters (% initial) (22.85%) (24.44%) -- Initial # reads Remaining # reads following Bowtie2 contaminant (PhiX and mito. genome) removal (% initial) (63.19%) (20.66%) (63.19%) (14.36%) Remaining # reads following Trimmomatic cleaning (% initial) (63.42%) (20.73%) (63.42%) (14.41%) Remaining # reads with Nextera adapters following Trimmomatic cleaning (a) (b) Figure 1. FastQC plots for MP_1 reads (a) before and (b) after removal of adapter and low quality sequences. Merging paired-end reads using FLASH

2 Paired end and Mate pair reads were merged using FLASH (Fast Length Adjustment of SHort reads) to get extended fragments (Hereafter flash reads). Parameters Used: Min overlap: 10 Max overlap: 65 Max mismatch density: Allow "outie" pairs: false Cap mismatch quals: false Combiner threads: 10 Input format: FASTQ, phred_offset=33 Output format: FASTQ, phred_offset=33 Read combination statistics: Reads Total pairs Combined pairs Uncombined pairs Percent combined Paired End Reads % Mate Pair Reads % The final output consists of merged reads as extended fragments (Single end reads) and not combined reads (R1 and R2) K-mer size estimation The optimal kmer size to use for genome assembly was identified using kmergenie. Final kmergenie predictions are : ONLY PAIRED READS Predicted best k Predicted assembly size Raw Reads 88 59,588,865 bp Flash Reads 90 59,651,847 bp For genome assembly using ABySS, the kmer size of 88 was used for raw reads and 90 for flash reads. For genome assembly using SOAPdenovo, the kmer size of 88 was used for raw reads and 89 for flash reads Genome assembly using ABySS Trimmed and cleaned Paired end and mate paired reads (Raw reads) were assembled using ABySS with kmer size of 88 abyss-pe name=raw_kmer88 k=88 lib='pe1' mp='mp1' \ pe1='./pe/phix.mito.unmap.1.fastq./pe/phix.mito.unmap.2.fastq' \ mp1='./mp/cleaned_mate-pair_reads.1.fastq./mp/cleaned_mate-pair_reads.2.fastq' \ Merged overlapped reads as single end reads (FLASH extended reads) and not combined reads (as paired end reads) were assembled using ABySS with kmer size of 90. abyss-pe name=flash_kmer90 k=90 lib='pe1' mp='mp1' \ pe1='./flash/pe/pe.out.notcombined_1.fastq./flash/pe/pe.out.notcombined_2.fastq' \ mp1='./flash/mp/mp.out.notcombined_1.fastq./flash/mp/mp.out.notcombined_2.fastq' \ se='./flash/pe/pe.out.extendedfrags.fastq./flash/mp/mp.out.extendedfrags.fastq' SOAPdenovo2 Cleaned and corrected paired-end reads and cleaned mate pair reads were used to construct a de novo genome assembly with SOAPdenovo2 and an estimated genome size of 50 Mb. The.config file used to run SOAPdenovo2 is available as an attachment to this page (m_roreri_soapden ovo2.config.txt). QUAST was run with the --scaffolds option to assess the quality of the SOAPdenovo2 assembly. With this option, QUAST produces two sets of

3 summary statistics: (1) for the provided file of scaffolds and (2) for scaffolds resulting from QUAST breaking provided scaffolds after 10 consecutive Ns. When QUAST broke provided scaffolds according to this rule, the number of Ns per 100 Kb in the assembly decreased from to The total length of the broken assembly was Mb, with contigs, an N50 of 14,883, and with 139 Kb in the largest contig. Prior to breaking scaffolds, the total assembly length was 55.7 Mb, with an N50 of 107,380, and with 1.78 Mb in the largest contig. REAPR was also used to check assembly accuracy ( reapr.sh). Only 2365 of 63,281,957 bases were found to be error-free using the perfectmap approach. The FCD rate plot (Fig. 2a) and read coverage plot (Fig. 2b) are below. (a) (b) Figure 2. (a) FCD rate and (b) read coverage plots provided by REAPR analysis of the SOAPdenovo2 assembly. Merging assemblies using GAM-NGS GAM-NGS (Genomic Assemblies Merger for Next Generation Sequencing) is used to improve de novo assemblies by merges two assemblies (assembly reconciliation) in order to enhance contiguity and possibly correctness. The two assemblies being merged are put in an hierarchical order, electing one of the sequences as master, the other is the slave. In situations where weights/features do not allow us to take a position (e.g. similar weights), we decided to be as conservative as possible, trusting only contigs belonging to the master assembly. GAM-NGS is a multistep process which involves the following steps: (The entire script file can be found here: gamngs_raw.txt) GAM-NGS needs in input, for each assembly and for each read library, a file that lists BAM files of aligned libraries. Next step is to create a block (Block construction) with minimum reads specify the reads required to build a block. Merging the master and slave assemblies with associated blocks constructed in the previous step. For our analysis, we created merged assmebly with ABySS assembly as master assembly and SOAPdenovo assembly as the slave assmebly. QUAST results for merged GAM-NGS assembly (from broken scaffolds): Total length Mb, with 4787 contigs, an N50 of 36,036, and with 290 Kb in the largest contig. Prior to breaking scaffolds, the total assembly length was 57.9 Mb, with an N50 of 102,258, and with 926 kb in the largest contig. Transcriptome Assembly Data Quality Control Trimmomatic Ran Trimmomatic with the following parameters for each individual for adapter removal: paired-end; phred33; ILLUMINACLIP:/group/bioinfo/apps/apps/trimmomatic-0.32/adapters/TruSeq3-PE-2.fa:2:20:9; LEADING:7 TRAILING:7; SLIDINGWINDOW:4:13; MINLEN:30. Individual Input Read Pairs Both Surviving Forward Only Surviving Reverse Only Surviving Dropped JD-6 35,221, (98.02%) (1.76%) (0.17%) (0.04%)

4 JD-8 33,453, (98.16%) (1.64%) (0.17%) (0.04%) JD-5 37,640, (98.15%) (1.64%) (0.18%) (0.04%) MCA ,143, (98.13%) (1.66%) (0.17%) (0.04%) MCA ,156, (97.76%) (2.04%) (0.16%) (0.05%) MCA ,396, (97.99%) (1.80%) (0.17%) (0.04%) Table 3. Number of reads surviving after adapter removal via Trimmomatic, and their percentages. Bowtie For contaminant removal, the mitogenome and phix fasta files were downloaded from the wiki page. These two fasta files were merged together and indexed via bowtie-build. Trimmed reads were mapped against the merged contaminant fasta file, and reads that didn t align to the contaminant fasta file were treated as clean reads and pushed into a new fastq file. Reads that mapped to contaminants were formatted into a sam file. Example: bowtie -t -S --un JD-8_trimmomatic_forward_filtered.fastq \ merged_contaminants.fasta \ JD-8_trimmomatic_forward_paired.fastq \ JD-8_forward_contaminant_alignments.sam bowtie -t -S --un JD-8_trimmomatic_reverse_filtered.fastq \ merged_contaminants.fasta \ JD-8_trimmomatic_reverse_paired.fastq \ JD-8_reverse_contaminant_alignments.sam FastQC After adapter and contaminant removal, FastQC was used to quantify the quality of the reads. (a) (b)

5 Figure 3. FastQC plots for JD-5 forward reads (a) before and (b) after removal of adapter and low quality sequences. De novo Transcriptome Assembly Assembly the transcriptome with Trinity (v2.2.0) with newly cleaned reads. All individuals and their forward/reverse files were input into the trinity run. Simplified script: Trinity --seqtype fq --max_memory 96G --CPU 20 --verbose --left JD-8_trimmomatic_forward_filtered.fastq,\ JD-6_trimmomatic_forward_filtered.fastq,\ JD-5_trimmomatic_forward_filtered.fastq,\ JD-8_trimmomatic_forward_filtered.fastq,\ MCA-2504_trimmomatic_forward_filtered.fastq,\ MCA-2952_trimmomatic_forward_filtered.fastq,\ MCA-2974_trimmomatic_forward_filtered.fastq, \ --right JD-8_trimmomatic_reverse_filtered.fastq,\ JD-6_trimmomatic_reverse_filtered.fastq,\ JD-5_trimmomatic_reverse_filtered.fastq,\ JD-8_trimmomatic_reverse_filtered.fastq,\ MCA-2504_trimmomatic_reverse_filtered.fastq,\ MCA-2952_trimmomatic_reverse_filtered.fastq,\ MCA-2974_trimmomatic_reverse_filtered.fastq \ &> trinity_log.txt Trinity stats on Trinity.fasta Counts of transcripts, etc. Total trinity 'genes': Total trinity transcripts: Percent GC: Stats based on ALL transcript contigs: Contig N10: Contig N20: 8868 Contig N30: 7097 Contig N40: 5735 Contig N50: 4654 Median contig length: 1774 Average contig: Total assembled bases:

6 Stats based on ONLY LONGEST ISOFORM per 'GENE': Contig N10: 9477 Contig N20: 7060 Contig N30: 5559 Contig N40: 4449 Contig N50: 3604 Median contig length: 885 Average contig: Total assembled bases: RSEM Prepared reference rsem-prepare-reference \ --num-threads 20 \ --transcript-to-gene-map gene.map \ --bowtie2 \ trinity_out_dir/trinity.fasta Trinity_ref Calculated expression (for each individual) example: rsem-calculate-expression -p 20 --bowtie2 --paired-end \ cleaned_reads/bowtie/repaired_reads/jd-6_forward_filtered_fixed.fastq \ cleaned_reads/bowtie/repaired_reads/jd-6_reverse_filtered_fixed.fastq \ Trinity_ref JD-6.rsem Reference Genome Analysis Reference genome: GCF_ _M_roreri_MCA_2997_v1_genomic.fna Reference annotation: GCF_ _M_roreri_MCA_2997_v1_genomic.gff Tophat Used bowtie to build a index of reference genome. Tophat was used to generate BAM files for each individual. An error was generated in the Tophat run due to pair alignments therefore, we ran another script to help correct for mismatches or missing reads with BBMap's "repair.sh". This can happen when there is an unequal number of reads and/or when a read-trimming tools throws away one read in a pair but not the other. Strain Input total Aligned pairs Overall mapping rate JD5 73,862,560 31,381, % JD6 69,023,838 28,939, % JD8 65,649,570 27,724, % MCA ,945,066 25,383, % MCA ,898,998 14,487, % MCA ,224,452 23,234, % Cufflinks cufflinks -p 20 --multi-read-correct --compatible-hits-norm \ -o cufflinks_out/jd-6 \ -G GCF_ _M_roreri_MCA_2997_v1_genomic.gff \ JD-6_repaired/accepted_hits.bam HTSeq htseq-count --quiet \ --format=bam \ --stranded=no \

7 JD5_accepted_hits.bam \ JD5_transcripts.gtf \ >JD5.count Cuffmerge Assemblies_file.txt contained pathways to each transcript.gtf file produced for each individual run. cuffmerge -p 20 \ -o cuffmerge_out \ -g GCF_ _M_roreri_MCA_2997_v1_genomic.gff \ -s GCF_ _M_roreri_MCA_2997_v1_genomic.fna \ assemblies_file.txt Cuffdiff cuffdiff -o cuffdiff_out -b GCF_ _M_roreri_MCA_2997_v1_genomic.fna -p 20 -L JD-6,JD-8,JD-5,MCA-2504,MCA-2952,MCA u cuffmerge_out/merged.gtf \ JD-6/accepted_hits.bam \ JD-8/accepted_hits.bam \ JD-5/accepted_hits.bam \ MCA-2504/accepted_hits.bam \ MCA-2952/accepted_hits.bam \ MCA-2974/accepted_hits.bam CummeRbund cummerbund is a visualization package for Cufflinks high-throughput sequencing data. It is designed to help navigate through the Cuffdiff RNA-Seq differential expression analysis data. All the following commands are executed in R ( > setwd("cuffdiff_out") > library(cummerbund) > cuff<-readcufflinks() > cuff CuffSet instance with: 6 samples genes isoforms TSS CDS promoters splicing relcds > disp<-dispersionplot(genes(cuff)) > disp

8 Figure 4. Dispersion plots to estimate overdispersion for each sample as a quality control measure > genes.scv<-fpkmscvplot(genes(cuff)) > isoforms.scv<-fpkmscvplot(isoforms(cuff))

9 (a) (b) Figure 5. Estimating squared coefficient of variation (CV 2) across all (a) genes and (b) isoforms > dens<-csdensity(genes(cuff)) > dens Figure 6. Density distributions of FPKM scores across samples > b<-csboxplot(genes(cuff)) > b

10 Figure 7. Boxplots showing log(fpkm) values for each sample > dend<-csdendro(genes(cuff)) > dend Figure 8. Dendrogram' with 2 branches and 6 members total, at height

11 Differential Gene Expression Analyses (R) For de novo transcriptome analysis, transcript counts from RSEM (RSEM.counts.matrix) were imported into DESeq2. For reference-based transcriptome analysis, *.count files produced by HTSeq were imported into DESeq2. DESeq2 was run, and "JD" and "MCA" were set as conditions in order to compare gene expression between the two mating types ( rsem_to_deseq2.r, htseq_to_deseq2.r). For both the de novo and the reference-based transcriptome, samples within mating types clustered more closely together than samples between mating types (Fig. 4). Samples within mating types were also more closely correlated with one another than samples between mating types (Fig. 5), with decreased distances between samples within mating types (Fig. 6). (a) (b) Figure 9. For the (a) de novo assembly, mating type accounted for 86% of the variance between samples, and for the (b) reference-based assembly, mating type accounted for 85% of the variance between samples. (a)

12 (b) Figure 10. Pairwise correlations between samples for the (a) de novo and (b) reference-based transcriptome assemblies. (a) (b) Figure 11. Distances between samples for the (a) de novo and (b) reference-based transcriptome assemblies. Volcano plots were constructed to illustrate differentially expressed genes for both assemblies (Fig. 12).

13 (a) (b) Figure 12. Volcano plots illustrating DEGs for the (a) de novo and (b) reference-based transcriptome assemblies. Genes labeled in red have an adjusted p-value < 1e-10. Genes labeled in green have an adjusted p-value < 1e-10 and exhibited a log2 fold-change greater than 4. For the de novo assembly, 204 DEGs were identified. For the reference-based assembly, 155 DEGs were identified. For the de novo assembly, DEGs were plotted as a heatmap, demonstrating differences in up- and down-regulation between samples (Fig. 13). DEGs are also listed in a FASTA file: denovo_degs.fasta.

14 Figure 13. Heatmap of DEGs identified in de novo assembly analysis. DEGs of Interest DEG#1 Identity Ref Gene ID Cufflinks ID Assoc. GO Terms Moror_3144 XLOC_ GO: :C:fungal-type cell wall Description Hydrophobin 2 GO: :F:structural constituent of cell wall

15 Figure 14. Expression levels (FPKM) of hydrophobin-producing gene Moror_3144 between M. roreri mating types. Description Moror_3144 is a gene responsible for producing hydrophobin proteins in M. roreri. Hydrophobins are a large family of cysteine-rich proteins that serve as a main component in fungal cell walls. Specifically, hydrophobins help form a hydrophobic sheath on the exterior of fungal spore and hyphae cell walls. Hydrophobins have high surfactant activity, which results from their self-assembly at hydrophilic hydrophobic interfaces to form an amphipathic monolayer. As a critical component of fungal cell walls, hydrophobins play a key role in fungal interactions with both the external environment and other fungi. Specifically, expression of SC3 hydrophobins is responsible for the production of aerial hyphae and attachment of hyphae to hydrophobic surfaces in basidiomycete fungi. There are multiple hydrophobin genes in the genome of individual fungi, due to possibly different functional roles or differential expression, or to different environmental conditions or developmental stages. Hypothesis #1 Our data demonstrated increased expression of Moror_3144 in the MCA mating type. Differential expression of Moror_3144 and other hydrophobin genes could produce structural differences in fungal cell wall composition, rendering mating types incompatible upon initial contact. Future Experiment #1 Knockout Moror_3144 and other genes responsible for the production of hydrophobin proteins in closely related basidiomycete fungi (e.g. Schizophyllum commune) to see if mating and production of fruiting bodies are altered or inhibited. Hypothesis #2 SC15 is a secreted protein of 191 a.a. with a hydrophilic N terminal half and a highly hydrophobic C- terminal half. SC15 is responsible for formation of aerial hyphae and attachment in the absence of the SC3 hydrophobin. Mating types with lower expression of SC3 genes should see an increase in expression of SC15 protein producing genes. As a result, gene expression levels of SC15 should increase when hydrophobins are knocked down. Future Experiment #2 Silence expression of known SC3 genes and analyze expression levels of SC15 genes in both M. roreri mating types. In our dataset, we looked at expression levels of known SC3 and SC15 genes in M. roreri. Hydrophobin Moror_3144 Moror_3864 Moror_3142 SC15 Moror_16098 Moror_9579 Moror_2440 Moror_3141.

16

17 Figure 15. Expression levels (FPKM) of hydrophobin and SC15 genes between M. roreri mating types.