Sequence File Formats

Similar documents
Basic concepts of molecular biology

11 questions for a total of 120 points

DNA.notebook March 08, DNA Overview

Basic concepts of molecular biology

Bioinformatics. ONE Introduction to Biology. Sami Khuri Department of Computer Science San José State University Biology/CS 123A Fall 2012


Algorithms in Bioinformatics ONE Transcription Translation

Using DNA sequence, distinguish species in the same genus from one another.

Introduction to metagenome assembly. Bas E. Dutilh Metagenomic Methods for Microbial Ecologists, NIOO September 18 th 2014

Sequence assembly. Jose Blanca COMAV institute bioinf.comav.upv.es

BIOSTAT516 Statistical Methods in Genetic Epidemiology Autumn 2005 Handout1, prepared by Kathleen Kerr and Stephanie Monks

APPENDIX. Appendix. Table of Contents. Ethics Background. Creating Discussion Ground Rules. Amino Acid Abbreviations and Chemistry Resources

EE550 Computational Biology

Bi Lecture 3 Loss-of-function (Ch. 4A) Monday, April 8, 13

DNA is normally found in pairs, held together by hydrogen bonds between the bases

Outline. The types of Illumina data Methods of assembly Repeats Selecting k-mer size Assembly Tools Assembly Diagnostics Assembly Polishing

De Novo Assembly of High-throughput Short Read Sequences

ENZYMES AND METABOLIC PATHWAYS

Unit 1. DNA and the Genome

Materials Protein synthesis kit. This kit consists of 24 amino acids, 24 transfer RNAs, four messenger RNAs and one ribosome (see below).

Molecular Biology. Biology Review ONE. Protein Factory. Genotype to Phenotype. From DNA to Protein. DNA à RNA à Protein. June 2016

Nucleic acid and protein Flow of genetic information

6-Foot Mini Toober Activity

Genomics and Database Mining (HCS 604.3) April 2005

Granby Transcription and Translation Services plc

CECS Introduction to Bioinformatics University of Louisville Spring 2004 Dr. Eric Rouchka

Protein Structure Analysis

TECHNIQUES FOR STUDYING METAGENOME DATASETS METAGENOMES TO SYSTEMS.

10/20/2009 Comp 590/Comp Fall

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

What is necessary for life?

Protein Synthesis. Application Based Questions

BME205: Lecture 2 Bio systems. David Bernick

Lecture 19A. DNA computing

Key Concept Translation converts an mrna message into a polypeptide, or protein.

A Zero-Knowledge Based Introduction to Biology

NEXT GENERATION SEQUENCING. Farhat Habib

36. The double bonds in naturally-occuring fatty acids are usually isomers. A. cis B. trans C. both cis and trans D. D- E. L-

De novo meta-assembly of ultra-deep sequencing data

BIOLOGY. Monday 14 Mar 2016

Introduction to Next Generation Sequencing

Data Basics. Josef K Vogt Slides by: Simon Rasmussen Next Generation Sequencing Analysis

De novo assembly of human genomes with massively parallel short read sequencing. Mikk Eelmets Journal Club

Sequence Assembly and Alignment. Jim Noonan Department of Genetics

Lecture 14: DNA Sequencing

2018 Protein Modeling Exam Key

NORWEGIAN UNIVERSITY OF SCIENCE AND TECHNOLOGY DEPARTMENT OF BIOTECHNOLOGY Professor Bjørn E. Christensen, Department of Biotechnology

De novo genome assembly with next generation sequencing data!! "

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools

Introduction to Bioinformatics

Alignment and Assembly

De Novo Assembly (Pseudomonas aeruginosa MAPO1 ) Sample to Insight

What is necessary for life?

Mate-pair library data improves genome assembly

Basic Bioinformatics: Homology, Sequence Alignment,

Mapping strategies for sequence reads

CBC Data Therapy. Metagenomics Discussion

A Guide to Consed Michelle Itano, Carolyn Cain, Tien Chusak, Justin Richner, and SCR Elgin.

De Novo and Hybrid Assembly

BIOLOGY LTF DIAGNOSTIC TEST DNA to PROTEIN & BIOTECHNOLOGY

Station 1: DNA Structure Use the figure above to answer each of the following questions. 1.This is the subunit that DNA is composed of. 2.

Assembling a Cassava Transcriptome using Galaxy on a High Performance Computing Cluster

Concepts and methods in genome assembly and annotation

Homework. A bit about the nature of the atoms of interest. Project. The role of electronega<vity

CHAPTER 1. DNA: The Hereditary Molecule SECTION D. What Does DNA Do? Chapter 1 Modern Genetics for All Students S 33

Forensic Science: DNA Evidence Unit

How life. constructs itself.

Next Generation Sequencing. Tobias Österlund

GENOME ASSEMBLY FINAL PIPELINE AND RESULTS

CFSSP: Chou and Fasman Secondary Structure Prediction server

Lecture 7. Next-generation sequencing technologies

Genome Assembly Background and Strategy

THE GENETIC CODE Figure 1: The genetic code showing the codons and their respective amino acids

The combination of a phosphate, sugar and a base forms a compound called a nucleotide.

DNA stands for deoxyribose nucleic acid

DE NOVO GENOME ASSEMBLY OF THE AFRICAN CATFISH (CLARIAS GARIEPINUS)

Bioinformatic analysis of Illumina sequencing data for comparative genomics Part I

Daily Agenda. Warm Up: Review. Translation Notes Protein Synthesis Practice. Redos

Dynamic Programming Algorithms

Chemistry 121 Winter 17

Create a model to simulate the process by which a protein is produced, and how a mutation can impact a protein s function.

Next Gen Sequencing. Expansion of sequencing technology. Contents

Tutorial for Stop codon reassignment in the wild

Biochemistry and Cell Biology

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014

De novo whole genome assembly

de novo Transcriptome Assembly Nicole Cloonan 1 st July 2013, Winter School, UQ

Aipotu II: Biochemistry

First&year&tutorial&in&Chemical&Biology&(amino&acids,&peptide&and&proteins)&! 1.&!

Bioinformatics for Genomics

The Effect of Using Different Neural Networks Architectures on the Protein Secondary Structure Prediction

High-Throughput Bioinformatics: Re-sequencing and de novo assembly. Elena Czeizler

Big Idea 3C Basic Review

Fundamentals of Protein Structure

Supplementary Figure 1. Design of the control microarray. a, Genomic DNA from the

From Infection to Genbank

Translating the Genetic Code. DANILO V. ROGAYAN JR. Faculty, Department of Natural Sciences

Biology: The substrate of bioinformatics

Contact us for more information and a quotation

Additional Case Study: Amino Acids and Evolution

Transcription:

File Formats

Sequence File Formats Different formats for different uses Competing formats developed in parallel Some easy to read, some easy to write programs Don t have to stick to these formats, but parsers already written! All formats are plain text

Standard genetic code Symbol Meaning Origin G G Guanine A A Adenine C C Cytosine T T Thymine R G or A purine Y T or C pyrimidine M A or C amino K G or T Keto N G or A or T or C any

Standard protein codes One Three Amino acid One Three Amino acid A Ala Alanine M Met Methionine C Cys Cysteine N Asn Asparagine D Asp Aspartic acid P Pro Proline E Glu Glutamate R Arg Arginine F Phe Phenylalanine S Ser Serine G Gly Glycine T Thr Threonine H His Histidine V Val Valine I Ile Isoleucine W Trp Tryptophan K Lys Lysine Y Tyr Tyrosine L Leu Leucine X Xaa Unknown

GenBank More complex, includes detailed information on genes, cds, annotation etc Human readable Difficult to parse Use standard parsers (bioperl, biojava, etc)

LOCUS NC_001418 5833 bp ss-dna circular PHG 17-APR-2009 DEFINITION Pseudomonas phage Pf3, complete genome. ACCESSION NC_001418 VERSION NC_001418.1 GI:9626316 DBLINK Project:14061 KEYWORDS. SOURCE Pseudomonas phage Pf3 ORGANISM Pseudomonas phage Pf3 Viruses; ssdna viruses; Inoviridae; Inovirus. FEATURES Location/Qualifiers source 1..5833 /organism="pseudomonas phage Pf3" /mol_type="genomic DNA" /host="pseudomonas aeruginosa" /db_xref="taxon:10872" /note="pf3 bacteriophage DNA from P.aeruginosa infected with plasmid RP1." gene join(5763..5833,1..106) /locus_tag="pf3_1" /db_xref="geneid:1260905" CDS join(5763..5833,1..106) /locus_tag="pf3_1" /note="orf 58, part 2" /codon_start=1 /transl_table=11 /product="hypothetical protein" /protein_id="np_040651.1" /db_xref="gi:9626317" /db_xref="geneid:1260905" /translation="msyyvcvqlvndvchewaersdllslpegsglqiggmllllsat AWGIQQIARLLLNR"

3241 aggtcctgtt ggccttaaga tcacccaagg gcatcttgcc agatggtacc gtcattactt 3301 atgagaaaat atcctcaatg ggtaatggct ataccttcga gcttgagtcg cttatatttg 3361 cggctcttgc tcggtcttta tgcgaattac tgggcttacg accgtcagat gttacggtct 3421 atggcgatga cataatattg ccatcagacg cgtgcagtcc tctagttgaa gttttctcct 3481 atgttggttt tcgtaccaac aagaagaaaa cgttttctag tggaccgttc cgagagtcgt 3541 gcggaaagca ctactttttg ggcgttgacg tcacaccttt ctacatacgt cgccgtatag 3601 tgagtccctc cgatctcata ctggttttga accagatgta tcgttgggcc acaattgacg 3661 gcgtatggga tcctagggta tatcctgtat acaccaagta tagacgttac cttccggaaa 3721 ttctccggag gaatgtcgtg cctgatggat acggtgatgg tgccctcgtc ggatctgtct 3781 taatcagtcc tttcgcagaa aatcgcggtt gggttcggcg tgtgccgatg attatagaca 3841 agaggaaaga ccgagttcgt gacgaatatg gttcgtatct ctacgagcta tggtcgttgc 3901 agcaactcga atgtgacagt gagttcccct ttaacgggtc gctggtcgtt ggttccactg 3961 atggcactct cgcttacgca caccgagaac ggttacctac cgttatcagt gatgccgtaa 4021 gtgcgtttga catcatgtgg ataccgtgca gtagtcgtgt cctggctccc tacggggatt 4081 tccggaggca cgaaggctct atcctaaaaa tggggtagcg cctgggaggg gtgcattatg 4141 caccctaggt tagcaatact taaactaacc ttctcaaaag agagagtgaa ggctctgctt 4201 tgccctcact cctccca // LOCUS NC_003301 3192 bp ds-rna linear PHG 23-AUG-2008 DEFINITION Pseudomonas phage phi8 segment S, complete sequence. ACCESSION NC_003301 VERSION NC_003301.1 GI:17736965 DBLINK Project:14731 KEYWORDS. SOURCE Pseudomonas phage phi8 ORGANISM Pseudomonas phage phi8 Viruses; dsrna viruses; Cystoviridae; Cystovirus.

GFF3 Tab separated format Easy to parse Attributes are tag/value pairs separated by ; Columns: 1. Contig 2. Source database 3. Feature type 4. Start 5. Stop 6. Score 7. Strand 8. Phase 9. Attributes

ASN.1 Developed as computer readable form of GenBank Not widely used

ASN.1 seq { id { local id 1 }, descr { title "" }, inst { repr raw, mol aa, length 131, topology linear, { seq-data iupacaa "TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQA TGGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSRSAGSRPNRFA PTLMSSCITSTTGPPAWAGDRSHE" } }, seq { id { local id 1 }, descr { title "" }, inst { repr raw, mol aa, length 131, topology linear, { seq-data iupacaa "TSPASIRPPAGPSSR---------RPSPPGPRRPTGRPCCSAAPRRPQAT GGWKTCSGTCTTSTSTRHRGRSGW----------RASRKSMRAACSRSAGSRPNRFAPTL MSSCITSTTGPPAWAGDRSHE" } }

Fasta l Also called Pearson Simplest file format. Easy to parse, easy to use >identifier [optional information] ATGACTAGCATGCATCGATCGATCGACTAGCATG ACTGCACTACGACGACAGCAAC >identifier2 [optional information] ACTAGCTCAGCTAGAGAGCTACGATCAGCACTAC atccgatagcatgacttactacgctagcatcagtcat CAT

Qual >identifier [optional information] 35 35 35 35 22 35 35 35 31 35 31 35 34 35 35 31 35 36 35 37 36 36 36 22 35 34 31 35 35 35 35 25 >identifier2 [optional information] 35 36 37 31 35 35 28 28 28 35 34 34 33 32 34 34 33 29 34 36 15 28 27 27 27 27 21 35 35 35 33 33

Based on fasta format fastq Contains information about the quality of the sequence Quality comes from sequencing machines! Four lines per sequence: Line starting @ = identifier line before the sequence DNA sequence Line starting + = identifier line before the quality scores String = quality scores as ASCII + 33

fastq @SRR014849.1 EIXKN4201CFU84 length=93 GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGA AAGGGTTTTGAATTTCAAACCCTTTTCGGTTTCCAA CCTTCCAAAGCAATGCCAATA +SRR014849.1 EIXKN4201CFU84 length=93 3+&$#"""""""""""7F@71,'";C?,B;?6B;:E A1EA1EA5 9B:?:#9EA0D@2EA5':>5?:%A;A8 A;?9B;D@/=<?7=9<2A8==

Base calling l l l l l Need to be sure which base you have identified Depends on the technology Each machine includes software Phred is an historical package developed by at U. Washington Phred scores are probability that the base is correct

Quality values l l l l Phred 10: 1 x 10 1 chance that the base is wrong Phred 20: 1 x 10 2 chance that the base is wrong Phred 30: 1 x 10 3 chance that the base is wrong Phred 40: 1 x 10 4 chance that the base is wrong l Phred 99: the base is correct! l Fastq scores are the score + 33 then converted to ascii text

ASCII character codes ASCII Char ASCII Char ASCII Char ASCII Char ASCII Char 33! 50 2 70 F 90 Z 110 n 34 " 51 3 71 G 91 [ 111 o 35 # 52 4 72 H 92 \ 112 p 36 $ 53 5 73 I 93 ] 113 q 37 % 54 6 74 J 94 ^ 114 r 38 & 55 7 75 K 95 _ 115 s 39 ' 56 8 76 L 96 ` 116 t 40 ( 57 9 77 M 97 a 117 u 41 ) 58 : 78 N 98 b 118 v 42 * 59 ; 79 O 99 c 119 w 43 + 60 < 80 P 100 d 120 x 44, 61 = 81 Q 101 e 121 y 45-62 > 82 R 102 f 122 z 46. 63? 83 S 103 g 123 { 47 / 64 @ 84 T 104 h 124 48 0 65 A 85 U 105 i 125 } 49 1 66 B 86 V 106 j 126 ~

fastq @SRR014849.1 EIXKN4201CFU84 length=93 DNA sequence GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGA AAGGGTTTTGAATTTCAAACCCTTTTCGGTTTCCAA CCTTCCAAAGCAATGCCAATA +SRR014849.1 EIXKN4201CFU84 length=93 3+&$#"""""""""""7F@71,'";C?,B;?6B;:E A1EA1EA5 9B:?:#9EA0D@2EA5':>5?:%A;A8 A;?9B;D@/=<?7=9<2A8== Quality scores Note: Illumina used to have a format of fastq that was not compatible with everyone else s format!

Ion quality scores PRINSEQ Rob Schmieder

prinseq

Basic data analysis New dataset Assemble data Perform similarity search

Bad data analysis

Bad data analysis

Bad data analysis

Bad data analysis

Bad data analysis

Bad data analysis

New dataset Good data analysis

Good data analysis New dataset Quality control & Preprocessing

Good data analysis New dataset Quality control & Preprocessing Assembly Similarity search

Good data analysis New dataset Quality control & Preprocessing Assembly Similarity search

3 Tools for metagenomic data http://prinseq.sourceforge.net http://tagcleaner.sourceforge.net http://deconseq.sourceforge.net

Quality control and data preprocessing

Number and Length of Sequences

Number/Length of sequences Bad Good Reads should be approx. same length (same number of cycles) à Short reads are likely lower quality

Quality of Sequences

Linearly degrading quality across the read àtrim low quality ends

Quality filtering Any region with homopolymer will tend to have a lower quality score Huseet al. found that sequences with an average score below 25 had more errors than those with higher averages Huse et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology (2007)

Low quality sequence issue Most assemblers or aligners do not take into account quality scores Errors in reads complicate assembly, might cause misassembly, or make assembly impossible

What if quality scores are not available? Alternative: Infer quality from the percent of Ns found in the sequence Removes regions with a high number of Ns Huseet al. found that presence of any ambiguous base calls was a sign for overall poor sequence quality Huse et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology (2007)

Ambiguous bases If you can afford the loss, filter out all reads containing Ns Assemblers (e.g. Velvet) and aligners (SHAHA2, BWA, ) use 2-bit encoding system for nucleotides some replace Ns with random base, some with fixed base (e.g. SHAHA2 & Velvet = A) 2-bit example: 00 A, 01 C, 10 G, 11 - T

Tag Sequences

o tag ID tag TA tags

Detect and remove tag sequences

Data upload Tag sequence definition

Tag sequence prediction

Parameter definition Download results

Sequence Contamination

Principal component analysis (PCA) of dinucleotide relative abundance Microbial metagenomes Viral metagenomes

Identification and removal of sequence contamination

Contaminant identification Current methods have critical limitations Dinucleotide relative abundance uses information content in sequences à can not identify single contaminant sequences Sequence similarity seems to be only reliable option to identify single contaminant sequences BLAST against human reference genome is slow and lacks corresponding regions (gaps, variants, ) Novel sequences in every new human genome sequenced* * Li et al.: Building the sequence map of the human pan-genome. Nature Biotechnology (2010)

DeconSeq web interface Two types of reference databases Remove Retain

DeconSeq web interface (cont.)

Human DNA contamination identified in 145 out of 202 metagenomes

Contamination Identification

http://edwards.sdsu.edu/genomepeek

16S

reca

rpob

groel

Using prinseq to check the quality of the data

Using prinseq 1. In the folder, go to /home/qiime/documents/ecoli 2. Right click on CIA49E_coli_DH10B_Control_200.zip Choose Extract Here 3. Right click on R_2014_02_04_00_54_34_user_CIA-49-Ion_PGM_E_coli_DH10B_Control_200 Choose open terminal here 4. Run this command: prinseq-lite.pl -fastq Ecoli.fastq -graph_stats ld,gc,qd,ns,ts -graph_data ~/Desktop/DH10B.gd 5. Open firefox: go to http://edwards.sdsu.edu/prinseq Choose Get Report

After checking the quality: 1. Trim the sequences to remove low quality (<15) at the right end 2. Filter the sequences to remove sequences less than 200 bp. 3. Create a new fastq file prinseq-lite.pl -h

Trimming prinseq-lite.pl -fastq ecoli.fastq -min_len 200 -trim_qual_right 15 -out_good DH10B.trimmed

Convert fastq to fasta On the virtual box: l fastq2qual_fasta.pl l fastq_to_fasta l prinseq-lite.pl On the web: l http://edwards.sdsu.edu/cgi-bin/fastq2fasta.cgi

Using prinseq to convert fastq to fasta prinseq-lite.pl -h prinseq-lite.pl -fastq Ecoli_trimmed.fastq -out_format 2 -out_good Ecoli_trimmed

Sequence Assembly

Assembling the data l Problem: the longest single sequence possible is 1,000 bp, and most technology is 50-500 bp. l Microbial genomes are 2,000,000 bp l Therefore how do you sequence a whole genome?

Sequencing the genomes l Extract DNA l Shear DNA into small pieces l Ligate adapters on each end l Sequencing using next generation sequencing

Sequence assembly l l Before we look at the data Can we make longer pieces

The assembly l l l l A hierarchical data structure that maps sequence data to a reconstruction of the target. The assembly groups l l reads into contigs contigs into scaffolds Contigs provide l l multiple sequence alignment of reads consensus sequence. Scaffolds provide l l contig order and orientation sizes of the gaps between contigs.

Sequence assembly Reads Contigs Scaffolds

Four approaches to assembly Naïve approach Greedy approach Overlap / Layout / Consensus de Bruijn Graphs

Naïve approach l l l l Compare every sequence to every other sequence Find stretches that are the same Need to account for phred scores what if a base is wrong? How long of a sequence do you need to be unique?

Sequence composition l l 4 bases 4 n chance of finding a sequence if all evenly used (they are not) l 3 bp: 4 3 = 64 l 8 bp: 4 8 = 65,336 l 20 bp: 4 20 = 1,099,511,627,776

Problems with this approach l Sequences are not random l Most genomes contain biased information l Repeat sequences in the genome

Greedy approaches Start with a sequence Keep extending it while another sequence matches the end When can not be extended further, mark as a contig

Improve greedy approachs Only use high quality sequence Use reads that are represented more than n-times in the sample (SSAKE) End to end overlap vs. partial overlap Ignores low coverage regions also incorporate quality scores (SHARCGS) In general, greedy approaches are fast but not very good. Make lots of short contigs

Overlap / Layout / Consensus All versus all comparison (done with K-mers for speed). Generate approximate read layout as an overlap graph. Use multiple sequence alignments to resolve layout.

Newbler (O/L/C) Makes unitigs Single contigs with no discrepancies Merge unitigs into contigs. May split unitigs and even reads (could be chimeras) Use coverage to compensate for base calls Works in flow space to calculate homopolymeric tracts. More accurate than average of averages

Assembly is a graph problem Overlap/Layout/Consensus de Bruijn Graph Greedy graphs A graph is nodes + edges node edge

Assemble these two sequences! AACCGGT CCGGTTA Consensus: AACCGGTTA

AACCGGT as graphs Node = K-mers; edges = nodes that overlap by K-1 bases. aacc accg ccgg cggt Here K = 4, but in reality K = 19 to 31

CCGGTTA as graphs ccgg cggt ggtt gtta

Join the two graphs ccgg cggt ggtt gtta aacc accg ccgg cggt

Join the two graphs ccgg cggt ggtt gtta aacc accg ccgg cggt

Join the two graphs ccgg cggt ggtt gtta aacc accg ccgg cggt AACCGGTTA

Differences between an overlap graph and a de Bruijn graph for assembly. Schatz M C et al. Genome Res. 2010;20:1165-1173 2010 by Cold Spring Harbor Laboratory Press

Problems with all assemblies l l l Sequences are not random Most genomes contain biased information Repeat sequences in the genome

Repeats

Repeats

Repeats

Repeats

Repeats have multiple sinks/sources

Repeats have multiple sinks/sources 16s Salmonella has 7 rrn operons Salmonella recombines at rrn operons Helm and Maloy

Repeat sequences l l l l What happens if the repeat is longer than the read length? Need paired end reads to resolve order Need pairs that span the repeat Need pairs with one end in the repeat

Mate pair and paired end sequencing

Mate pair Sequencing (Ion Torrent) Add linkers

Mate pair sequencing (Ion Torrent) Nick Sequencing migration

Paired End Sequencing (Illumina) Left read DNA fragment Right read

Paired End Sequencing (Illumina) Number of fragments Fragment length

Paired End Sequencing (Illumina)

Joining paired end sequences _ \ / \ _ \ _) _ / _ \ _) / / \ _ < _ /_/ \_\_ \_\ PEAR v0.9.6 [January 15, 2015] - [+bzlib +zlib] Citation - PEAR: a fast and accurate Illumina Paired-End read merger Zhang et al (2014) Bioinformatics 30(5): 614-620 doi:10.1093/bioinformatics/btt593 License: Creative Commons Licence Bug-reports and requests to: Tomas.Flouri@h-its.org and Jiajie.Zhang@h-its.org Usage: pear <options> Standard (mandatory): -f, --forward-fastq <str> Forward paired-end FASTQ file. -r, --reverse-fastq <str> Reverse paired-end FASTQ file. -o, --output <str> Output filename. http://sco.h-its.org/exelixis/web/software/pear/

Repeats A B C Paired end reads or mate pairs

Sequence assembly Reads Contigs Scaffolds

Current assemblers AMOS Celera WGA Assembler CLC Genomics Workbench DNA Dragon DNAnexus Euler Geneious IDBA (Iterative De Bruijn graph short read Assembler) LIGR Assembler (derived from TIGR Assembler) MIRA (Mimicking Intelligent Read Assembly) Newbler Phrap SSAKE SOAPdenovo SPAdes Velvet

Assembly 1. De novo assembly with newbler (runassembly) 2. Map to the genome with newbler (runmapping) 3. De novo assembly with spades (spades.py)

De novo assembly with newbler 1. Convert fastq to fasta 2. Use this command: runassembly <fasta file> e.g. (all on one line) runassembly -noinfo -nobig -noace Ecoli.fasta 3. How good is the assembly basic_stats 454AllContigs.fna

basic_stats: assembly ------------- BASIC FASTA STATISTICS --------------- Total number of sequences: 1882 Total number of bases: 4,266,422 bp (4.27 Mb) Average sequence length: 2266.96 Minimum sequence length: Maximum sequence length: N50: 3491 bp 100 bp 23892 bp 50% of total sequence length is contained in 369 sequences GC %: 50.79 % -----------------------------------------------------

basic_stats: mapping ------------- BASIC FASTA STATISTICS --------------- Total number of sequences: 654 Total number of bases: 4,339,931 bp (4.34 Mb) Average sequence length: 6635.98 Minimum sequence length: Maximum sequence length: N50: 12274 bp 160 bp 49317 bp 50% of total sequence length is contained in 109 sequences GC %: 50.76 %

N50

N50 Length = N50

Mapping to a genome 1. Convert the E. coli genbank file to fasta format GB2Fasta.pl CP000948_DH10B.gbk CP000948_DH10B.fasta 2. Map the reads to the genome runmapping -cpu 2 -noinfo -noace -nobig -gref CP000948_DH10B.fasta DH10B.trimmed.fasta 3. How good is the assembly basic_stats 454AllContigs.fna

SPAdes 3.7.1 1. Run this command (all on one line): /opt/spades-3.7.1-linux/bin/spades.py --iontorrent -k 21,33,55,77,99,127 --mismatch-correction -t 2 -s ecoli.fastq -o ecoli.spades

Hybrid assembly Geni Silva

scaffold_builder http://edwards.sdsu.edu/scaffold_builder Silva et al. Source Code for Biology and Medicine 2013, 8:23

How do we know if the assembly is good?

QUAST 1. Run this command (all on one line): quast -o Ecoli.quast -R ecoli.fasta runmapping/454allcontigs.fna runassembly/454allcontigs.fna python /opt/quast_3.2/quast.py -o Ecoli.quast -R ecoli.fasta runmapping/454allcontigs.fna runassembly/ 454AllContigs.fna 2. When the command finishes: gnome-open DH10B.quast/report.html (also open DH10B.quast/alignment.svg and DH10B.quast/report.pdf)

Mauve Locally co-linear blocks Homologous regions shared by any two genomes User specified cutoff to homologous

Multi-mums Multiple unique matches Identify k-mers that: Are shared by 2-or more genomes Only appear once per genome Bounded by mismatched base (cannot be extended further)

Mauve Find all local multimums Calculate phylogenetic guide tree Select a subset of multimums to create LCBs Anchor regions not covered with multimums Align LCBs

How to construct... Construct sorted list of k-mers Find set that match occurrence criteria: Once per genome >1 genome Extend until mismatch

Constructing sorted list Hash using hash function: S 1 = start of multimum M in first genome S j = start of multimum M in genome j

Join all mums Join all mums together where: M i S j <= M i+1 S j For any genome j, the start of the multimum i is less than or equal to the start of the next multimum.

Start with all matches

Partition into locally co-linear blocks

Remove low scoring lcb's (3 * length of k-mer)

Reduce k-mer size, and repeat for unaligned regions

Mauve Speed Aligned many prokaryote genomes Human, mouse, and rat genomes

Problems - Repeats For a repetitive element appearing r times in G genomes: r G possible combinations r of them are correct!

Mauve From the Apps menu on the top left, choose Mauve File Align with ProgressiveMauve