Computational Genomics [2017] Faction 2: Genome Assembly Results, Protocol & Demo

Similar documents
Workflow of de novo assembly

Introduction: Methods:

Assembly of Ariolimax dolichophallus using SOAPdenovo2

A Roadmap to the De-novo Assembly of the Banana Slug Genome

De Novo Assembly of High-throughput Short Read Sequences

De novo genome assembly with next generation sequencing data!! "

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

Introduction to RNA sequencing

De novo whole genome assembly

Gap Filling for a Human MHC Haplotype Sequence

Computational assembly for prokaryotic sequencing projects

De novo whole genome assembly

Variation detection based on second generation sequencing data. Xin LIU Department of Science and Technology, BGI

Genome Assembly, part II. Tandy Warnow

RNA-seq Data Analysis

COPE: An accurate k-mer based pair-end reads connection tool to facilitate genome assembly

Genomics and Transcriptomics of Spirodela polyrhiza

Mate-pair library data improves genome assembly

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

RNA-Seq Software, Tools, and Workflows

Next Generation Sequencing Technologies

RNA-Seq Workshop AChemS Sunil K Sukumaran Monell Chemical Senses Center Philadelphia

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

L3: Short Read Alignment to a Reference Genome

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

CloG: a pipeline for closing gaps in a draft assembly using short reads

Quality assessment and control of sequence data

Introduction to NGS Analysis Tools

De novo genome assembly. Dr Torsten Seemann

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

RNAseq Differential Gene Expression Analysis Report

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang

RNA-Seq with the Tuxedo Suite

White paper on de novo assembly in CLC Assembly Cell 4.0

Sanger vs Next-Gen Sequencing

NOW GENERATION SEQUENCING. Monday, December 5, 11

Course Presentation. Ignacio Medina Presentation

Bioinformatics pipeline development to support Helicobacter pylori genome analysis Master s thesis in Computer Science

Title: High-quality genome assembly of channel catfish, Ictalurus punctatus

Barcode Sequence Alignment and Statistical Analysis (Barcas) tool

Analysis of Structural Variants using 3 rd generation Sequencing

arxiv: v1 [q-bio.gn] 25 Nov 2015

Lees J.A., Vehkala M. et al., 2016 In Review

Introduction to Bioinformatics

Illumina (Solexa) Throughput: 4 Tbp in one run (5 days) Cheapest sequencing technology. Mismatch errors dominate. Cost: ~$1000 per human genme

Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads

Genomic DNA ASSEMBLY BY REMAPPING. Course overview

Read Mapping and Variant Calling. Johannes Starlinger

ABSTRACT COMPUTATIONAL METHODS TO IMPROVE GENOME ASSEMBLY AND GENE PREDICTION. David Kelley, Doctor of Philosophy, 2011

Next Gen Sequencing. Expansion of sequencing technology. Contents

Genomics AGRY Michael Gribskov Hock 331

Variant Detection in Next Generation Sequencing Data. John Osborne Sept 14, 2012

Next-Generation Sequencing. Technologies

SCIENCE CHINA Life Sciences

Bioinformatics small variants Data Analysis. Guidelines. genomescan.nl

Data Analysis with CASAVA v1.8 and the MiSeq Reporter

PRE- AND POST-PROCESSING TOOLS FOR NEXT-GENERATION SEQUENCING DE NOVO ASSEMBLIES. Sari S. Khaleel

Supplementary Materials and Methods

Outline. DNA Sequencing. Whole Genome Shotgun Sequencing. Sequencing Coverage. Whole Genome Shotgun Sequencing 3/28/15

RNA-Sequencing analysis

CDC s Advanced Molecular Detection (AMD) Sequence Data Analysis and Management

Hybrid Error Correction and De Novo Assembly with Oxford Nanopore

Next Generation Sequence Analysis and Computational Genomics Using Graphical Pipeline Workflows

N ext-generation sequencing (NGS) technologies have become common practice in life science1. Benefited

Bionano Access 1.1 Software User Guide

Analysis of barcode sequencing

Approaches for in silico finishing of microbial genome sequences

arxiv: v1 [q-bio.gn] 20 Apr 2013

OHSU Digital Commons. Oregon Health & Science University. Benjamin Cordier. Scholar Archive

Why can GBS be complicated? Tools for filtering, error correction and imputation.

Haploid Assembly of Diploid Genomes

Incorporating Molecular ID Technology. Accel-NGS 2S MID Indexing Kits

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms

Mapping strategies for sequence reads

Long and short/small RNA-seq data analysis

NGS sequence preprocessing. José Carbonell Caballero

RNASEQ WITHOUT A REFERENCE

Metagenomics is the study of all micro-organisms coexistent in an environmental area, including

Single Nucleotide Polymorphisms Caused by Assembly Errors

Ensembl Tools. EBI is an Outstation of the European Molecular Biology Laboratory.

Read Quality Assessment & Improvement. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

De novo metatranscriptome assembly and coral gene expression profile of Montipora capitata with growth anomaly

TSSpredator User Guide v 1.00

De novo Genome Assembly

Next-Generation Sequencing in practice

ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter

Quality assessment and control of sequence data. Naiara Rodríguez-Ezpeleta

Lectures 18, 19: Sequence Assembly. Spring 2017 April 13, 18, 2017

Reference genomes and common file formats

Analysing 454 amplicon resequencing experiments using the modular and database oriented Variant Identification Pipeline

From Variants to Pathways: Agilent GeneSpring GX s Variant Analysis Workflow

HLA and Next Generation Sequencing it s all about the Data

Corset: enabling differential gene expression analysis for de novo assembled transcriptomes

SNP calling and VCF format

Sequencing the genomes of Nicotiana sylvestris and Nicotiana tomentosiformis Nicolas Sierro

Utilization of defined microbial communities enables effective evaluation of meta-genomic assemblies

Synthetic long-read sequencing reveals intraspecies diversity in the human microbiome

Targeted Sequencing Reveals Large-Scale Sequence Polymorphism in Maize Candidate Genes for Biomass Production and Composition

RNA Seq: Methods and Applica6ons. Prat Thiru

Transcription:

Computational Genomics [2017] Faction 2: Genome Assembly Results, Protocol & Demo Christian Colon, Erisa Sula, Juichang Lu, Tian Jin, Lijiang Long, Rohini Mopuri, Bowen Yang, Saminda Wijeratne, Harrison Kim

Outline Objective Initial Workflow Pre-Assembly Tools Assembler Tools Post-Assembly Tools Final Workflow Result discussion

Objective Determine the best method to assemble the Salmonella genomes Evaluate and compare the available tools Assemble reads and combine results into super-assembly Compare results from different tools and find the best assemblies

Initial Workflow De Novo MaSuRCA Raw Reads Trim Reads Trimmomatic Prinseq Trim Galore De Novo Velvet SPAdes Abyss SOAPdenovo2 Mergers CISA Metassembler Scaffolding /Extensions SSPACE SOAPdenovo SOPRA Improvement Pilon GapFiller FGAP Reference Bwa mem Final Assembly

Trim Reads

Trim Galore! Adapter trimming (13bp Illumina default) (--illumina) Clip options for bp removal prior to actual trimming (bias removal) Length option to discard reads shorter than a set INT amount FastQC for read quality assessment Usage: $ trim_galore --illumina --clip_r1 17 --clip_r2 17 --three_prime_clip_r1 5 --three_prime_r2 5 --length 100 --paired read1.fq.gz read2.fq.gz -o output.dir

Assemblers

SPAdes Short read de Bruijn graph assembler, takes single and paired ends High level view of SPAdes assembly: Assembly graph construction with multi-sized de Bruijn graphs and bulge resolution Integration of paired-end data to determine genomic distance Contig reconstruction Error correction by BayesHammer Usage: $ spades.py-1 --pe1-1 <read_one> --pe1-2 <read_two> -t 4 -k <kmer list> -o <output directory>

MaSuRCA Example Configuration File Algorithm combines benefits of debruijn graphs with overlap layout consensus Generates Super Reads Input reads: raw reads generated from Illumina, no preprocessing Usage: $ masurca configure_file.txt Generates assemble.sh file in current directory $./assemble.sh Creates actual results

Velvet Manipulates de Bruijn graphs for de novo genome assembly Assembly steps: Read hashing and graph construction Error removal (tips; bubbles; and erroneous connections) Resolve repeats Velvet Optimiser: VelvetOptimiser is a multi-threaded Perl script for automatically optimising the three primary parameter options (K, -exp_cov, -cov_cutoff) for the Velvet de novo sequence assembler Usage:./Velvetoptimiser.pl -d out.dir -s start_kmer -e end_kmer -x step_size -f file_type -shortpaired -separate read1.file_type read2.file_type -t # of threads --optfunckmer n50

SOAPdenovo2 Short read, de novo assembler capable of working up to the size of the human genome Employs de Bruijn graphing algorithm SOAPdenovo2 is improved to accommodate reduced memory consumption in the graphing step, resolves repeats in contig assembly, and increased coverage in scaffolding Usage: SOAPdenovo-63mer all -s ~/data/config1 -K 63 -R -o graph_prefix Example Configuration File

ABySS Usage: abyss-pe name= <name> k=<kmer size> in= reads1.fa reads2.fa

Merger

CISA Integrate the assemblies into a hybrid set of contigs. CISA runs in four phases Phase 1: Identification of the representative contigs and possible extensions Phase 2: The uncertain regions located in the end of contigs are clipped Phase 3: blastn is performed to merge the contigs iteratively and identify repetitive regions. Phase 4: blastn with overlap larger than the maximum size of the repetitive regions. Usage: Merging Reads: $ python Merge.py <config> Running CISA : $ python CISA.py <config>

Metassembler Merging and optimizing de novo genome assemblies. Ranking assemblies by N50 size descending usually gives the best superassembly. Usage: $ metassemble --conf <conf-file> --outd <output-dir>

Scaffolding

SSPACE w/o extension Uses pre-assembled contigs from a de novo assembler to generate scaffolds Estimates the gap size between contigs to construct scaffolds based on their spatial relationship Can also be ran with extension to improve contigs prior to scaffolding Uses BWA to map the reads to the contigs The position and orientation of the reads are stored to determine the spatial relationship of the contigs Usage: $./SSPACE_Standard_v3.0.pl -l library_1.txt -s CISA1.fa -k 5 -a 0.70 -n 15 -z 0 -b SSPACE_Output1 -p 1

SSPACE w/ extension Uses BWA to map our trimmed reads to the contigs to determine what reads were unmapped in the assembly of the contigs Uses these unmapped reads to extend the contigs prior to scaffolding If enough of unmapped reads contain the same nucleotide, it will be added to the sequence Usage: $./SSPACE_Standard_v3.0.pl -l library_1.txt -s CISA1.fa -x 1 -m 50 -o 20 -r 0.9 -k 5 -a 0.70 -n 15 -z 0 -p 1 -b SSPACE_Output1

SOAPdenovo2 SOAPdenovo is a novel short-read assembly method that can build a de novo draft assembly for the human-sized genomes. Couldn t figure out how to isolate scaffolding tool within SOAPdenovo so that it could be used with other assemblies (specific for SOAPdenovo assemblies) Ran it only on the SOAPdenovo contigs

Improvement

GapFiller GapFiller is a stand-alone program for closing gaps within pre-assembled scaffolds. The input data is given by pre-assembled scaffold sequences (FASTA) and NGS paired-read data (FASTA or FASTQ). The final gap-filled scaffolds are provided in FASTA format. Gaps are iteratively filled from the left and right edge by incorporating one overhang nucleotide at a time, provided the position is sufficiently covered. Usage: $ perl GapFiller.pl -l <library.txt> -s <genome.fasta> (<library.txt>: <libname> <forward_fq> <reverse_fq> <insert_size> <standard_dev> FR )

FGAP Via alternative assemblies or incorporating alternative data, this tool focuses on deriving sequences best suited for closing gaps. The tool depends upon the functionalities of matlab and blast tools for working out potential sequences. We used the trimmed reads from the preassembly step as alternative data for the tool. Usage: $ run_fgap.sh <Matlab-libs> -d <genome.fasta> -a <fasta-dataset> -b <blast-libs> (<fasta-dataset>: <dataset1.fasta>,<dataset2.fasta>,...,<datasetn.fasta> )

PILON Pilon is a software tool which can be used to: Automatically improve draft assemblies Find variation among strains, including large event detection Requirement Input a FASTA file of the genome along with one or more BAM files of reads aligned to the input FASTA file. Pilon uses read alignment analysis to identify inconsistencies between the input genome and the evidence in the reads. Usage: $java Xmx15G jar pilon-1.16.jar --genome <genome.fasta> --frags <mapping.bam> --variant

All-in-one Tool

Unicycler Integrate SPAdes, samtools, Bowtie2, Samtools, and Blast+, pilon. Takes paired end reads and long reads (optional) to perform hybrid assembly. Uses graph to do scaffolding. Usage: $ unicycler -1 short_reads_1.fastq.gz -2 short_reads_2.fastq.gz -l long_reads_optional.fq.gz -o out.dir

Reference based assembly

Pipeline for reference base assembly #!/bin/bash # reference_base_assembly_pipeline sample_prefix=sp0001 read1='$sample_prefix'_r1_val_1.fq.gz' read2='$sample_prefix'_r2_val_2.fq.gz' fasta_file=1045684451.fasta #bwa mapping bwa mem $fasta_file $read1 $read2 > $sample_prefix'.sam' #samtools sort samtools sort -O bam -T temp1 $sample_prefix'.sam' > $sample_prefix'.bam' #samtools index samtools index $sample_prefix'.bam' #samtools mpileup samtools mpileup -f 1045684451.fasta -gu $sample_prefix'.bam' bcftools call -c -O b -o $sample_prefix'.raw.bcf' #convert file to fastq format bcftools view -O v $sample_prefix'.raw.bcf' vcfutils.pl vcf2fq > $sample_prefix'.fastq' #convert fastq to fasta python3 convert_fastq_to_fasta.py -q $sample_prefix'.fastq' -a $sample_prefix'.fasta'

Mapping coverage map using BRIG shows a small region with no coverage in some samples

A detailed look at the region with no reads

The region with no read mapping is a deletion of the reference Region: 375,500-414700 Around 39k

De novo assembly supports a transposon-like structure De novo assembly Reference Backbone Large insertion(~39kb) Repetitive region 46bp

Caveats in reference based assembly Genome_2 Inversion Genome_2 1. Genome_1 Genome_1 Genome_2 Insertion Genome_2 2. Genome_1 Genome_1

Alignment of de novo assembly with reference shows no inversion or insertion

Pre Assembly Results

De novo Assembly Results

Use of Quast Reference Genome: -R <fasta file> Genome Annotation File: -G <gff, gtf, bed> Scaffold splitting: -s

Selection of Assembly Score

Performance of Different Assemblers

Performance of Post-Assembly Tools

Performance of Post-Assembly Tools

Performance of Unicycler

Performance of Pilon

Large Deletion or Insertion? Possible.

Final Workflow De Novo MaSuRCA Raw Reads Trim Reads Trim Galore De Novo Velvet SPAdes Abyss Mergers Metassembler Improvement Pilon De Novo Unicycler Reference BWA mem Final Assembly

References Vicedomini R, Vezzi F, Scalabrin S, Arvestad L, Policriti A. 2013. GAM-NGS: genomic assemblies merger for next generation sequencing. BMC Bioinformatics 14(Suppl 7):S6. 10.1186/1471-2105-14-S7-S6. Wences, A. H. & Schatz, M. C. Metassembler: merging and optimizing de novo genome assemblies. Genome Biology 16, 207 (2015). Zimin AV, Smith DR, Sutton G, Yorke Ja: Assembly reconciliation. Bioinformatics (Oxford, England). 2008, 24: 42-5. 10.1093/bioinformatics/btm542. Lin S-H, Liao Y-C. CISA: Contig Integrator for Sequence Assembly of Bacterial Genomes. Watson M, ed. PLoS ONE. 2013;8(3):e60843. Aleksey V. Zimin, Guillaume Marçais, Daniela Puiu, Michael Roberts, Steven L. Salzberg, James A. Yorke; The MaSuRCA genome assembler. Bioinformatics 2013; 29 (21): 2669-2677. doi: 10.1093/bioinformatics/btt476 Luo R, Liu B, Xie Y, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience. 2012;1(1):18. Tanja Magoc, Stephan Pabinger, Stefan Canzar, Xinyue Liu, Qi Su, Daniela Puiu, Luke J. Tallon, Steven L. Salzberg; GAGE-B: an evaluation of genome assemblers for bacterial organisms. Bioinformatics 2013; 29 (14): 1718-1725. doi: 10.1093/bioinformatics/btt273