Computational Genomics [2017] Faction 2: Genome Assembly Results, Protocol & Demo

Computational Genomics [2017] Faction 2: Genome Assembly Results, Protocol & Demo Christian Colon, Erisa Sula, Juichang Lu, Tian Jin, Lijiang Long, Rohini Mopuri, Bowen Yang, Saminda Wijeratne, Harrison Kim

Outline Objective Initial Workflow Pre-Assembly Tools Assembler Tools Post-Assembly Tools Final Workflow Result discussion

Objective Determine the best method to assemble the Salmonella genomes Evaluate and compare the available tools Assemble reads and combine results into super-assembly Compare results from different tools and find the best assemblies

Initial Workflow De Novo MaSuRCA Raw Reads Trim Reads Trimmomatic Prinseq Trim Galore De Novo Velvet SPAdes Abyss SOAPdenovo2 Mergers CISA Metassembler Scaffolding /Extensions SSPACE SOAPdenovo SOPRA Improvement Pilon GapFiller FGAP Reference Bwa mem Final Assembly

Trim Reads

Trim Galore! Adapter trimming (13bp Illumina default) (--illumina) Clip options for bp removal prior to actual trimming (bias removal) Length option to discard reads shorter than a set INT amount FastQC for read quality assessment Usage: $ trim_galore --illumina --clip_r1 17 --clip_r2 17 --three_prime_clip_r1 5 --three_prime_r2 5 --length 100 --paired read1.fq.gz read2.fq.gz -o output.dir

Assemblers

SPAdes Short read de Bruijn graph assembler, takes single and paired ends High level view of SPAdes assembly: Assembly graph construction with multi-sized de Bruijn graphs and bulge resolution Integration of paired-end data to determine genomic distance Contig reconstruction Error correction by BayesHammer Usage: $ spades.py-1 --pe1-1 <read_one> --pe1-2 <read_two> -t 4 -k <kmer list> -o <output directory>

MaSuRCA Example Configuration File Algorithm combines benefits of debruijn graphs with overlap layout consensus Generates Super Reads Input reads: raw reads generated from Illumina, no preprocessing Usage: $ masurca configure_file.txt Generates assemble.sh file in current directory $./assemble.sh Creates actual results

Velvet Manipulates de Bruijn graphs for de novo genome assembly Assembly steps: Read hashing and graph construction Error removal (tips; bubbles; and erroneous connections) Resolve repeats Velvet Optimiser: VelvetOptimiser is a multi-threaded Perl script for automatically optimising the three primary parameter options (K, -exp_cov, -cov_cutoff) for the Velvet de novo sequence assembler Usage:./Velvetoptimiser.pl -d out.dir -s start_kmer -e end_kmer -x step_size -f file_type -shortpaired -separate read1.file_type read2.file_type -t # of threads --optfunckmer n50

SOAPdenovo2 Short read, de novo assembler capable of working up to the size of the human genome Employs de Bruijn graphing algorithm SOAPdenovo2 is improved to accommodate reduced memory consumption in the graphing step, resolves repeats in contig assembly, and increased coverage in scaffolding Usage: SOAPdenovo-63mer all -s ~/data/config1 -K 63 -R -o graph_prefix Example Configuration File

ABySS Usage: abyss-pe name= <name> k=<kmer size> in= reads1.fa reads2.fa

Merger

CISA Integrate the assemblies into a hybrid set of contigs. CISA runs in four phases Phase 1: Identification of the representative contigs and possible extensions Phase 2: The uncertain regions located in the end of contigs are clipped Phase 3: blastn is performed to merge the contigs iteratively and identify repetitive regions. Phase 4: blastn with overlap larger than the maximum size of the repetitive regions. Usage: Merging Reads: $ python Merge.py <config> Running CISA : $ python CISA.py <config>

Metassembler Merging and optimizing de novo genome assemblies. Ranking assemblies by N50 size descending usually gives the best superassembly. Usage: $ metassemble --conf <conf-file> --outd <output-dir>

Scaffolding

SSPACE w/o extension Uses pre-assembled contigs from a de novo assembler to generate scaffolds Estimates the gap size between contigs to construct scaffolds based on their spatial relationship Can also be ran with extension to improve contigs prior to scaffolding Uses BWA to map the reads to the contigs The position and orientation of the reads are stored to determine the spatial relationship of the contigs Usage: $./SSPACE_Standard_v3.0.pl -l library_1.txt -s CISA1.fa -k 5 -a 0.70 -n 15 -z 0 -b SSPACE_Output1 -p 1

SSPACE w/ extension Uses BWA to map our trimmed reads to the contigs to determine what reads were unmapped in the assembly of the contigs Uses these unmapped reads to extend the contigs prior to scaffolding If enough of unmapped reads contain the same nucleotide, it will be added to the sequence Usage: $./SSPACE_Standard_v3.0.pl -l library_1.txt -s CISA1.fa -x 1 -m 50 -o 20 -r 0.9 -k 5 -a 0.70 -n 15 -z 0 -p 1 -b SSPACE_Output1

SOAPdenovo2 SOAPdenovo is a novel short-read assembly method that can build a de novo draft assembly for the human-sized genomes. Couldn t figure out how to isolate scaffolding tool within SOAPdenovo so that it could be used with other assemblies (specific for SOAPdenovo assemblies) Ran it only on the SOAPdenovo contigs

Improvement

GapFiller GapFiller is a stand-alone program for closing gaps within pre-assembled scaffolds. The input data is given by pre-assembled scaffold sequences (FASTA) and NGS paired-read data (FASTA or FASTQ). The final gap-filled scaffolds are provided in FASTA format. Gaps are iteratively filled from the left and right edge by incorporating one overhang nucleotide at a time, provided the position is sufficiently covered. Usage: $ perl GapFiller.pl -l <library.txt> -s <genome.fasta> (<library.txt>: <libname> <forward_fq> <reverse_fq> <insert_size> <standard_dev> FR )

FGAP Via alternative assemblies or incorporating alternative data, this tool focuses on deriving sequences best suited for closing gaps. The tool depends upon the functionalities of matlab and blast tools for working out potential sequences. We used the trimmed reads from the preassembly step as alternative data for the tool. Usage: $ run_fgap.sh <Matlab-libs> -d <genome.fasta> -a <fasta-dataset> -b <blast-libs> (<fasta-dataset>: <dataset1.fasta>,<dataset2.fasta>,...,<datasetn.fasta> )

PILON Pilon is a software tool which can be used to: Automatically improve draft assemblies Find variation among strains, including large event detection Requirement Input a FASTA file of the genome along with one or more BAM files of reads aligned to the input FASTA file. Pilon uses read alignment analysis to identify inconsistencies between the input genome and the evidence in the reads. Usage: $java Xmx15G jar pilon-1.16.jar --genome <genome.fasta> --frags <mapping.bam> --variant

All-in-one Tool

Unicycler Integrate SPAdes, samtools, Bowtie2, Samtools, and Blast+, pilon. Takes paired end reads and long reads (optional) to perform hybrid assembly. Uses graph to do scaffolding. Usage: $ unicycler -1 short_reads_1.fastq.gz -2 short_reads_2.fastq.gz -l long_reads_optional.fq.gz -o out.dir

Reference based assembly

Pipeline for reference base assembly #!/bin/bash # reference_base_assembly_pipeline sample_prefix=sp0001 read1='$sample_prefix'_r1_val_1.fq.gz' read2='$sample_prefix'_r2_val_2.fq.gz' fasta_file=1045684451.fasta #bwa mapping bwa mem $fasta_file $read1 $read2 > $sample_prefix'.sam' #samtools sort samtools sort -O bam -T temp1 $sample_prefix'.sam' > $sample_prefix'.bam' #samtools index samtools index $sample_prefix'.bam' #samtools mpileup samtools mpileup -f 1045684451.fasta -gu $sample_prefix'.bam' bcftools call -c -O b -o $sample_prefix'.raw.bcf' #convert file to fastq format bcftools view -O v $sample_prefix'.raw.bcf' vcfutils.pl vcf2fq > $sample_prefix'.fastq' #convert fastq to fasta python3 convert_fastq_to_fasta.py -q $sample_prefix'.fastq' -a $sample_prefix'.fasta'

Mapping coverage map using BRIG shows a small region with no coverage in some samples

A detailed look at the region with no reads

The region with no read mapping is a deletion of the reference Region: 375,500-414700 Around 39k

De novo assembly supports a transposon-like structure De novo assembly Reference Backbone Large insertion(~39kb) Repetitive region 46bp

Caveats in reference based assembly Genome_2 Inversion Genome_2 1. Genome_1 Genome_1 Genome_2 Insertion Genome_2 2. Genome_1 Genome_1

Alignment of de novo assembly with reference shows no inversion or insertion

Pre Assembly Results

De novo Assembly Results

Use of Quast Reference Genome: -R <fasta file> Genome Annotation File: -G <gff, gtf, bed> Scaffold splitting: -s

Selection of Assembly Score

Performance of Different Assemblers

Performance of Post-Assembly Tools

Performance of Unicycler

Performance of Pilon

Large Deletion or Insertion? Possible.

Final Workflow De Novo MaSuRCA Raw Reads Trim Reads Trim Galore De Novo Velvet SPAdes Abyss Mergers Metassembler Improvement Pilon De Novo Unicycler Reference BWA mem Final Assembly

References Vicedomini R, Vezzi F, Scalabrin S, Arvestad L, Policriti A. 2013. GAM-NGS: genomic assemblies merger for next generation sequencing. BMC Bioinformatics 14(Suppl 7):S6. 10.1186/1471-2105-14-S7-S6. Wences, A. H. & Schatz, M. C. Metassembler: merging and optimizing de novo genome assemblies. Genome Biology 16, 207 (2015). Zimin AV, Smith DR, Sutton G, Yorke Ja: Assembly reconciliation. Bioinformatics (Oxford, England). 2008, 24: 42-5. 10.1093/bioinformatics/btm542. Lin S-H, Liao Y-C. CISA: Contig Integrator for Sequence Assembly of Bacterial Genomes. Watson M, ed. PLoS ONE. 2013;8(3):e60843. Aleksey V. Zimin, Guillaume Marçais, Daniela Puiu, Michael Roberts, Steven L. Salzberg, James A. Yorke; The MaSuRCA genome assembler. Bioinformatics 2013; 29 (21): 2669-2677. doi: 10.1093/bioinformatics/btt476 Luo R, Liu B, Xie Y, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience. 2012;1(1):18. Tanja Magoc, Stephan Pabinger, Stefan Canzar, Xinyue Liu, Qi Su, Daniela Puiu, Luke J. Tallon, Steven L. Salzberg; GAGE-B: an evaluation of genome assemblers for bacterial organisms. Bioinformatics 2013; 29 (14): 1718-1725. doi: 10.1093/bioinformatics/btt273