Faction 2: Genome Assembly Lab and Preliminary Data

Size: px
Start display at page:

Download "Faction 2: Genome Assembly Lab and Preliminary Data"

Transcription

1 Faction 2: Genome Assembly Lab and Preliminary Data [Computational Genomics 2017] Christian Colon, Erisa Sula, David Lu, Tian Jin, Lijiang Long, Rohini Mopuri, Bowen Yang, Saminda Wijeratne, Harrison Kim

2 Outline Data used for Preliminary results Comparison of Trimming tools Comparison of Assemblies Assembly assessments and Improvement Post-Assembly Goals

3 Objective Determine the best method to assembly the Salmonella genomes

4 Workflow Note: MaSuRCA takes in the raw reads directly, you are not supposed to trim them or anything

5 Data Used for Preliminary Results SP0001 (130 bp) short, good SP0004 (230 bp) long good SP0010 (204 bp) long bad

6 FastQC Results

7 Trimmed with Trimmomatic

8 Trimmed with Prinseq

9 Per Base sequence content

10 Using Prinseq Trimming adaptors and primers added during library construction Allows filtering, re-formatting and trimming of data Provides Summary statistics Command used: Prinseq-- lite.pl ---fastq <filename> read1.fastq --fastq2 <filename> read 2.fasta --trim_left 12 --trim_right 20

11 Using Trimmomatic Usage: java -jar <path to trimmomatic-0.36.jar> PE -phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 HEADCROP:20 -accounts for adapters using ILLUMINACLIP argument -removes low quality bases and N bases below quality 3 (default); LEADING and TRAILING arguments -adjustable sliding window (default being 4 bases at a time) with threshold setter for average quality base -HEADCROP allows one to cut off a certain amount of bases from the 5 end before trimming (remove bias)

12 Trim_Galore trim_galore --illumina --clip_r clip_r three_prime_clip_r1 5 --three_prime_clip_r2 5 --length paired Generate fastqc report at the end Allow individualized trimming at 5 and 3 end Allow selecting read source (illumina) to trim off adapter sequence Allow multiple input file at the same time.

13 Pipeline for assembly De novo genome assembly: what every biologist should know Nature Methods

14 SPAdes Can using paired-read libraries and unpaired reads libraries. Default k-mer length choose: 21, 33 and 55 Advantages of SPAdes: 1. Have read error correction tool. 2. Fast, user friendly. 3. Remove bulges and tips from end backtrack.

15 SPAdes sample1 sample4

16 sample10

17 VelvetOptimiser Automatically adjust hash value, cov_cutoff, and exp_cov according to selected index (N50). Allows multi-thread run../velvetoptimiser.pl -d ~/Assembly/velvet_contig/SP0001 -s 31 -e 57 -x 2 -f '-fastq.gz -shortpaired -separate /data/home/jlu345/assembly/trimmed_read/sp0001_r1_trimmed.fq.gz /data/home/jlu345/assembly/trimmed_read/sp0001_r2_trimmed.fq.gz' -t 6 --optfunckmer 'n50'

18 Choosing Reference Genome Total 53 reference genomes. Length between 4,700,000 to 4,900,000. The final contig is Salmonella enterica subsp. enterica serovar Heidelberg strain SH12-007, complete genome, CP

19 QUAST results for Assembly Velvet SPAdes SP1 SP2 SP4 SP5 SP10 SP1 SP2 SP4 SP5 SP6 SP7 SP10 N Coverage Total Length GC Content Num of Contigs Assembly Score

20 Feature response curve to evaluate correctness of assembly

21 Pilon didn t improve the assembly much.

22 Post Assembly Sort contigs Making scaffolds Close gaps in scaffolds

23 Scaffolding Tools SGA SOPRA SOAPdenovo2 SSPACE

24 SGA (String Graph Assembler) Most conservative tool Generally has least amount of error Makes less joins than other tools

25 SOPRA Balances making as many possible joins while keeping a low error rate

26 SOAPdenovo2 Fastest non-greedy tool Includes six modules: Read error correction Graph construction Contig assembly Read mapping Scaffold construction Gap closing

27 SSPACE Most cited Fastest Greedy Method

28 Gap Closing Tools (Pending Tests) IMAGE2 GapFiller FinIS SOAPdenovo2 GapCloser FGAP PILON Velvet *Source:

29 References Vicedomini R, Vezzi F, Scalabrin S, Arvestad L, Policriti A GAM-NGS: genomic assemblies merger for next generation sequencing. BMC Bioinformatics 14(Suppl 7):S / S7-S6. Wences, A. H. & Schatz, M. C. Metassembler: merging and optimizing de novo genome assemblies. Genome Biology 16, 207 (2015). Zimin AV, Smith DR, Sutton G, Yorke Ja: Assembly reconciliation. Bioinformatics (Oxford, England). 2008, 24: /bioinformatics/btm542. Lin S-H, Liao Y-C. CISA: Contig Integrator for Sequence Assembly of Bacterial Genomes. Watson M, ed. PLoS ONE. 2013;8(3):e Aleksey V. Zimin, Guillaume Marçais, Daniela Puiu, Michael Roberts, Steven L. Salzberg, James A. Yorke; The MaSuRCA genome assembler. Bioinformatics 2013; 29 (21): doi: /bioinformatics/btt476 Luo R, Liu B, Xie Y, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience. 2012;1(1):18. Tanja Magoc, Stephan Pabinger, Stefan Canzar, Xinyue Liu, Qi Su, Daniela Puiu, Luke J. Tallon, Steven L. Salzberg; GAGE-B: an evaluation of genome assemblers for bacterial organisms. Bioinformatics 2013; 29 (14): doi: /bioinformatics/btt273