Workflow of de novo assembly

Similar documents
Genome Assembly Software for Different Technology Platforms. PacBio Canu Falcon. Illumina Soap Denovo Discovar Platinus MaSuRCA.

De novo whole genome assembly

De novo whole genome assembly

De novo whole genome assembly

Outline. The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing

Faction 2: Genome Assembly Lab and Preliminary Data

Assembly of Ariolimax dolichophallus using SOAPdenovo2

A Roadmap to the De-novo Assembly of the Banana Slug Genome

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

Combined final report: genome and transcriptome assemblies

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Yellow-bellied marmot genome. Gabriela Pinho Graduate Student Blumstein & Wayne Labs EEB - UCLA

Computational Genomics [2017] Faction 2: Genome Assembly Results, Protocol & Demo

Genome Assembly. J Fass UCD Genome Center Bioinformatics Core Friday September, 2015

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

De Novo Assembly of High-throughput Short Read Sequences

De Novo Assembly (Pseudomonas aeruginosa MAPO1 ) Sample to Insight

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

GENOME ASSEMBLY FINAL PIPELINE AND RESULTS

De novo genome assembly with next generation sequencing data!! "

TruSPAdes: analysis of variations using TruSeq Synthetic Long Reads (TSLR)

A shotgun introduction to sequence assembly (with Velvet) MCB Brem, Eisen and Pachter

Human Genome Sequencing Over the Decades The capacity to sequence all 3.2 billion bases of the human genome (at 30X coverage) has increased

Analytics Behind Genomic Testing

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

Lecture 7. Next-generation sequencing technologies

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

de novo metagenome assembly

Genome Assembly: Background and Strategy

CSE182-L16. LW statistics/assembly

BIOINFORMATICS ORIGINAL PAPER

Using New ThiNGS on Small Things. Shane Byrne

Compute- and Data-Intensive Analyses in Bioinformatics"

Introduction: Methods:

SUPPLEMENTARY MATERIAL FOR THE PAPER: RASCAF: IMPROVING GENOME ASSEMBLY WITH RNA-SEQ DATA

The Basics of Understanding Whole Genome Next Generation Sequence Data

Mapping. Main Topics Sept 11. Saving results on RCAC Scaffolding and gap closing Assembly quality

Genomics AGRY Michael Gribskov Hock 331

De novo meta-assembly of ultra-deep sequencing data

Lecture 18: Single-cell Sequencing and Assembly. Spring 2018 May 1, 2018

De novo assembly in RNA-seq analysis.

Transcriptome analysis

C3BI. VARIANTS CALLING November Pierre Lechat Stéphane Descorps-Declère

De Novo and Hybrid Assembly

NEXT GENERATION SEQUENCING. Farhat Habib

de novo paired-end short reads assembly

MinION, GridION, how does Nanopore technology meet the needs of our users?

Genomics and Transcriptomics of Spirodela polyrhiza

Assembly and Validation of Large Genomes from Short Reads Michael Schatz. March 16, 2011 Genome Assembly Workshop / Genome 10k

DE NOVO GENOME ASSEMBLY OF THE AFRICAN CATFISH (CLARIAS GARIEPINUS)

Consensus Ensemble Approaches Improve De Novo Transcriptome Assemblies

Next-generation sequencing technologies

Incorporating Molecular ID Technology. Accel-NGS 2S MID Indexing Kits

Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Supplemental Materials

Lectures 20, 21: Single- cell Sequencing and Assembly. Spring 2017 April 20,25, 2017

NGS, Cancer and Bioinformatics. 5/3/2015 Yannick Boursin

Experimental Design. Sequencing. Data Quality Control. Read mapping. Differential Expression analysis

NOW GENERATION SEQUENCING. Monday, December 5, 11

Introduction to the MiSeq

Next Generation Sequences & Chloroplast Assembly. 8 June, 2012 Jongsun Park

Ecole de Bioinforma(que AVIESAN Roscoff 2014 GALAXY INITIATION. A. Lermine U900 Ins(tut Curie, INSERM, Mines ParisTech

Introduction to CGE tools

Zika infected human samples

Mate-pair library data improves genome assembly

DE NOVO WHOLE GENOME ASSEMBLY AND SEQUENCING OF THE SUPERB FAIRYWREN. (Malurus cyaneus) JOSHUA PEÑALBA LEO JOSEPH CRAIG MORITZ ANDREW COCKBURN

Next Gen Sequencing. Expansion of sequencing technology. Contents

De novo Genome Assembly

Mapping Next Generation Sequence Reads. Bingbing Yuan Dec. 2, 2010

Experimental Design Microbial Sequencing

DNA. bioinformatics. genomics. personalized. variation NGS. trio. custom. assembly gene. tumor-normal. de novo. structural variation indel.

Why are we here? Introduction

Analysis of RNA-seq Data

Usage Cases of GBS. Jeff Glaubitz Senior Research Associate, Buckler Lab, Cornell University Panzea Project Manager

DNA concentration and purity were initially measured by NanoDrop 2000 and verified on Qubit 2.0 Fluorometer.

Quality assessment and control of sequence data

The Genome Analysis Centre. Building Excellence in Genomics and Computational Bioscience

Direct determination of diploid genome sequences. Supplemental material: contents

NGS developments in tomato genome sequencing

Bioinformatic analysis of Illumina sequencing data for comparative genomics Part I

NGS technologies: a user s guide. Karim Gharbi & Mark Blaxter

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

Why QC? Next-Generation Sequencing: Quality Control. Illumina data format. Fastq format:

RNA-Seq Module 2 From QC to differential gene expression.

Master's Degree Project

de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ

Services Presentation Genomics Experts

Next-Generation Sequencing: Quality Control

Next Generation Sequencing. Tobias Österlund

Next-Generation Sequencing. Technologies

White paper on de novo assembly in CLC Assembly Cell 4.0

RNA-Seq de novo assembly training

Fast-SG: an alignment-free algorithm for hybrid assembly

Plant Breeding and Agri Genomics. Team Genotypic 24 November 2012

2nd (Next) Generation Sequencing 2/2/2018

BST 226 Statistical Methods for Bioinformatics David M. Rocke. March 10, 2014 BST 226 Statistical Methods for Bioinformatics 1

Mapping strategies for sequence reads

Next Generation Sequencing Lecture Saarbrücken, 19. March Sequencing Platforms

Introductie en Toepassingen van Next-Generation Sequencing in de Klinische Virologie. Sander van Boheemen Medical Microbiology

DATA FORMATS AND QUALITY CONTROL

Transcription:

Workflow of de novo assembly Experimental Design Clean sequencing data (trim adapter and low quality sequences) Run assembly software for contiging and scaffolding Evaluation of assembly Several iterations: adjust setting and software, or add more reads to improve assemblies

Experimental design using Illumina Platform Estimate genome size: 500 mb Platform: Illumina Hiseq Paired-end library: 150bp x 2 ; 2 lanes; ~100x coverage; Mate pair library: three libraries (5kb, 10kb, 15kb), 1 lane Software: Soap denovo Discovar MaSuRCA (hybrid, computationally demanding) Large memory computer BioHPC lab large memory server: 512mb to 1TB RAM

DISCOVAR approach Library fragment size: ~450 bp Read size: >= 250 bp x2 Depth: about 60x

Data cleaning with Trimmomatic Trim low quality data (quality score based trimming) Clip sequencing adapters (alignment to adapter sequence) java -jar /programs/trimmomatic/trimmomatic-0.32.jar PE -phred33 \ SRR1554178_1.fastq SRR1554178_2.fastq \ r1.fastq u1.fastq r2.fastq u2.fastq \ ILLUMINACLIP:/programs/trimmomatic/adapters/TruSeq3-PE-2.fa:2:30:10 LEADING:10 \ TRAILING:10 \ SLIDINGWINDOW:4:15 \ MINLEN:50 Minimum read length to keep

u2.fastq 2776034 Trimmomatic Input R1.fastq R2.fastq Output Paired1 Unpaired1 Paired2 Unpaired2 Palindrome clip mode r1.fastq 417503786 MINLEN:50 test r2.fastq 411903328 u1.fastq 62712666

Running assembly software Testing different size kmers and assembly software. Not always possible, as assembly of large genomes takes very long time on a large memory computer.

Test different kmer sizes SOAP denovo2 on BioHPC Lab Two binary codes with max kmer size 63 or 127 /programs/soapdenovo2/soapdenovo-63mer [options] /programs/soapdenovo2/soapdenovo-127mer [options] SOAPdenovo-127mer all -s config.txt -K 101 -R -o assembly SOAPdenovo-127mer all -s config.txt -K 127 -R -o assembly

SOADdenovo config file #maximal read length max_rd_len=101 [LIB] #average insert size avg_ins=300 #if sequence needs to be reversed reverse_seq=0 #in which part(s) the reads are used asm_flags=3 #in which order the reads are used while scaffolding rank=1 # cutoff of pair number for a reliable connection (at least 3 for short insert size) pair_num_cutoff=3 #minimum aligned length to contigs for a reliable read location (at least 32 for short insert s map_len=32 #a pair of fastq file, read 1 file should always be followed by read 2 file q1=r1.fastq q2=r2.fastq /programs/soapdenovo2/soapdenovo-127mer all -s config.txt -K 127 -R -o assembly

avg_ins=2000 reverse_seq=1 Multiple libraries can be mixed in one assembly [LIB] avg_ins=450 reverse_seq=0 asm_flags=3 q1=r1.fastq q2=r2.fastq [LIB] asm_flags=1 q=u1.fastq [LIB] avg_ins=2000 reverse_seq=1 asm_flags=3 q1=r1.fastq q2=r2.fastq

//End //Start k <- k min (k min is set at graph construction pregraph step); Construct initial de Bruijn graph with k min ; Remove low depth k-mers and cut tips; Merge bubbles of the de Bruijn graph; Repeat { k <- k + 1; Get contig graph H k from previous loop or construct from de Bruijn graph; Map reads to H k and remove the reads already represented in the graph; Construct H k+1 graph base on H k graph and the remaining reads with k; Remove low depth edges and weak edges in H k ; } Stop if k >= k max (k max = k set in contig step(-m)); Cut tips and merge bubbles; Output all contigs; Multi-Kmer Approach in Soap Denovo2 Start building graph with small kmer; Iteratively rebuild kmer by mapping larger kmers to previous graph

Multi-Kmer Approach in Soap Denovo2 SOAPdenovo-127mer all -s config.txt -K 95 -m 127 -R -o assembly -K: starting Kmer -m: end kmer

Run DISCOVAR on BioHPC Lab Use PICARD to convert fastq.gz files to bam file export JAVA_HOME=/usr/local/jdk1.8.0_45 export PATH=$JAVA_HOME/bin:$PATH java -jar /programs/picard-tools-2.1.1/picard.jar FastqToSam \ FASTQ=file1.fastq.gz FASTQ2=file2.fastq.gz \ O=reads.bam \ Run Discovar /programs/discovar/bin/discovar \ READS=reads.bam \ OUT_HEAD=assembly \ REGIONS=all

Evaluation of Genome assembly 1 Metrics for contig length N50 and L50 * N50 50% (base pairs) of the assemblies are contigs above this size. L50 Number of contigs greater than the N50 length. NG50 and LG50 N50 is calculated based on assembly size. NG50 is calculated based on estimated genome size.

Standalone tools for generating metrics (Most assembly software provides N50/L50 metrics in the report) Quast (http://bioinf.spbau.ru/quast ) Computing evaluation metrics Comparing with a reference genome (or between assemblies) Structure variation; Genome fraction: % represented reference genome Duplication ration: copy number ratio between assembly and reference in aligned region. Reference gene representation.

2. REAPR: Scoring each base of the assembly based on alignment of paired-end reads Input: BAM file from alignment of reads to the assembly (independent alignment of paired ends) Metrics reported by REAPR: Scaffold errors % of error free bases

3. Evaluate by gene content - BUSCO BUSCO gene sets: single-copy orthologs in at least 90% of the species in each lineage. Arthropods Vertebrates Fungi Bacteria Metazoans Ekaryotes Plants

BUSCO assessment workflow Sima o et al. Bioinformatics, 2015, 1 3

BUSCO Output C:complete D:duplicated F:fragmented M:missing (Report % of genes in each category) Sima o et al. Bioinformatics, 2015, 1 3

Run BUSCO on BioHPC Lab cp -r /programs/augustus-3.2.1./ # set PATH for required software export AUGUSTUS_CONFIG_PATH=/workdir/XXX/augustus.2.5.5/config export PATH=/programs/hmmer/binaries:/programs/emboss/bin:$PATH export PATH=/workdir/XXX/augustus-3.2.1/bin:$PATH A copy of August in working directory (writable) python3 /programs/busco_v1.2/busco_v1.2.py -o SAMPLE -in assembly.fa -l lineage_db -m genome

4. Evaluate by physical or genetic map Using Physical Map to Anchor Scaffold to Chromosome BioNano Map http://www.slideshare.net/kstatebioinformatics/using-bionano-maps-to-improve-an-insect-genome-assembly

Evaluate based on genetic mapping Use mapped GBS sequence tags to evaluate each contig Fei Lu, Buckler lab http://www.nature.com/ncomms/2015/150416/ncomms7914/full/ncomms7914.html