Sequence File Formats - PDF Free Download

File Formats

Sequence File Formats Different formats for different uses Competing formats developed in parallel Some easy to read, some easy to write programs Don t have to stick to these formats, but parsers already written! All formats are plain text

Standard genetic code Symbol Meaning Origin G G Guanine A A Adenine C C Cytosine T T Thymine R G or A purine Y T or C pyrimidine M A or C amino K G or T Keto N G or A or T or C any

Standard protein codes One Three Amino acid One Three Amino acid A Ala Alanine M Met Methionine C Cys Cysteine N Asn Asparagine D Asp Aspartic acid P Pro Proline E Glu Glutamate R Arg Arginine F Phe Phenylalanine S Ser Serine G Gly Glycine T Thr Threonine H His Histidine V Val Valine I Ile Isoleucine W Trp Tryptophan K Lys Lysine Y Tyr Tyrosine L Leu Leucine X Xaa Unknown

GenBank More complex, includes detailed information on genes, cds, annotation etc Human readable Difficult to parse Use standard parsers (bioperl, biojava, etc)

LOCUS NC_001418 5833 bp ss-dna circular PHG 17-APR-2009 DEFINITION Pseudomonas phage Pf3, complete genome. ACCESSION NC_001418 VERSION NC_001418.1 GI:9626316 DBLINK Project:14061 KEYWORDS. SOURCE Pseudomonas phage Pf3 ORGANISM Pseudomonas phage Pf3 Viruses; ssdna viruses; Inoviridae; Inovirus. FEATURES Location/Qualifiers source 1..5833 /organism="pseudomonas phage Pf3" /mol_type="genomic DNA" /host="pseudomonas aeruginosa" /db_xref="taxon:10872" /note="pf3 bacteriophage DNA from P.aeruginosa infected with plasmid RP1." gene join(5763..5833,1..106) /locus_tag="pf3_1" /db_xref="geneid:1260905" CDS join(5763..5833,1..106) /locus_tag="pf3_1" /note="orf 58, part 2" /codon_start=1 /transl_table=11 /product="hypothetical protein" /protein_id="np_040651.1" /db_xref="gi:9626317" /db_xref="geneid:1260905" /translation="msyyvcvqlvndvchewaersdllslpegsglqiggmllllsat AWGIQQIARLLLNR"

3241 aggtcctgtt ggccttaaga tcacccaagg gcatcttgcc agatggtacc gtcattactt 3301 atgagaaaat atcctcaatg ggtaatggct ataccttcga gcttgagtcg cttatatttg 3361 cggctcttgc tcggtcttta tgcgaattac tgggcttacg accgtcagat gttacggtct 3421 atggcgatga cataatattg ccatcagacg cgtgcagtcc tctagttgaa gttttctcct 3481 atgttggttt tcgtaccaac aagaagaaaa cgttttctag tggaccgttc cgagagtcgt 3541 gcggaaagca ctactttttg ggcgttgacg tcacaccttt ctacatacgt cgccgtatag 3601 tgagtccctc cgatctcata ctggttttga accagatgta tcgttgggcc acaattgacg 3661 gcgtatggga tcctagggta tatcctgtat acaccaagta tagacgttac cttccggaaa 3721 ttctccggag gaatgtcgtg cctgatggat acggtgatgg tgccctcgtc ggatctgtct 3781 taatcagtcc tttcgcagaa aatcgcggtt gggttcggcg tgtgccgatg attatagaca 3841 agaggaaaga ccgagttcgt gacgaatatg gttcgtatct ctacgagcta tggtcgttgc 3901 agcaactcga atgtgacagt gagttcccct ttaacgggtc gctggtcgtt ggttccactg 3961 atggcactct cgcttacgca caccgagaac ggttacctac cgttatcagt gatgccgtaa 4021 gtgcgtttga catcatgtgg ataccgtgca gtagtcgtgt cctggctccc tacggggatt 4081 tccggaggca cgaaggctct atcctaaaaa tggggtagcg cctgggaggg gtgcattatg 4141 caccctaggt tagcaatact taaactaacc ttctcaaaag agagagtgaa ggctctgctt 4201 tgccctcact cctccca // LOCUS NC_003301 3192 bp ds-rna linear PHG 23-AUG-2008 DEFINITION Pseudomonas phage phi8 segment S, complete sequence. ACCESSION NC_003301 VERSION NC_003301.1 GI:17736965 DBLINK Project:14731 KEYWORDS. SOURCE Pseudomonas phage phi8 ORGANISM Pseudomonas phage phi8 Viruses; dsrna viruses; Cystoviridae; Cystovirus.

GFF3 Tab separated format Easy to parse Attributes are tag/value pairs separated by ; Columns: 1. Contig 2. Source database 3. Feature type 4. Start 5. Stop 6. Score 7. Strand 8. Phase 9. Attributes

ASN.1 Developed as computer readable form of GenBank Not widely used

ASN.1 seq { id { local id 1 }, descr { title "" }, inst { repr raw, mol aa, length 131, topology linear, { seq-data iupacaa "TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQA TGGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSRSAGSRPNRFA PTLMSSCITSTTGPPAWAGDRSHE" } }, seq { id { local id 1 }, descr { title "" }, inst { repr raw, mol aa, length 131, topology linear, { seq-data iupacaa "TSPASIRPPAGPSSR---------RPSPPGPRRPTGRPCCSAAPRRPQAT GGWKTCSGTCTTSTSTRHRGRSGW----------RASRKSMRAACSRSAGSRPNRFAPTL MSSCITSTTGPPAWAGDRSHE" } }

Fasta l Also called Pearson Simplest file format. Easy to parse, easy to use >identifier [optional information] ATGACTAGCATGCATCGATCGATCGACTAGCATG ACTGCACTACGACGACAGCAAC >identifier2 [optional information] ACTAGCTCAGCTAGAGAGCTACGATCAGCACTAC atccgatagcatgacttactacgctagcatcagtcat CAT

Qual >identifier [optional information] 35 35 35 35 22 35 35 35 31 35 31 35 34 35 35 31 35 36 35 37 36 36 36 22 35 34 31 35 35 35 35 25 >identifier2 [optional information] 35 36 37 31 35 35 28 28 28 35 34 34 33 32 34 34 33 29 34 36 15 28 27 27 27 27 21 35 35 35 33 33

Based on fasta format fastq Contains information about the quality of the sequence Quality comes from sequencing machines! Four lines per sequence: Line starting @ = identifier line before the sequence DNA sequence Line starting + = identifier line before the quality scores String = quality scores as ASCII + 33

fastq @SRR014849.1 EIXKN4201CFU84 length=93 GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGA AAGGGTTTTGAATTTCAAACCCTTTTCGGTTTCCAA CCTTCCAAAGCAATGCCAATA +SRR014849.1 EIXKN4201CFU84 length=93 3+&$#"""""""""""7F@71,'";C?,B;?6B;:E A1EA1EA5 9B:?:#9EA0D@2EA5':>5?:%A;A8 A;?9B;D@/=<?7=9<2A8==

Base calling l l l l l Need to be sure which base you have identified Depends on the technology Each machine includes software Phred is an historical package developed by at U. Washington Phred scores are probability that the base is correct

Quality values l l l l Phred 10: 1 x 10 1 chance that the base is wrong Phred 20: 1 x 10 2 chance that the base is wrong Phred 30: 1 x 10 3 chance that the base is wrong Phred 40: 1 x 10 4 chance that the base is wrong l Phred 99: the base is correct! l Fastq scores are the score + 33 then converted to ascii text

ASCII character codes ASCII Char ASCII Char ASCII Char ASCII Char ASCII Char 33! 50 2 70 F 90 Z 110 n 34 " 51 3 71 G 91 [ 111 o 35 # 52 4 72 H 92 \ 112 p 36 $ 53 5 73 I 93 ] 113 q 37 % 54 6 74 J 94 ^ 114 r 38 & 55 7 75 K 95 _ 115 s 39 ' 56 8 76 L 96 ` 116 t 40 ( 57 9 77 M 97 a 117 u 41 ) 58 : 78 N 98 b 118 v 42 * 59 ; 79 O 99 c 119 w 43 + 60 < 80 P 100 d 120 x 44, 61 = 81 Q 101 e 121 y 45-62 > 82 R 102 f 122 z 46. 63? 83 S 103 g 123 { 47 / 64 @ 84 T 104 h 124 48 0 65 A 85 U 105 i 125 } 49 1 66 B 86 V 106 j 126 ~

fastq @SRR014849.1 EIXKN4201CFU84 length=93 DNA sequence GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGA AAGGGTTTTGAATTTCAAACCCTTTTCGGTTTCCAA CCTTCCAAAGCAATGCCAATA +SRR014849.1 EIXKN4201CFU84 length=93 3+&$#"""""""""""7F@71,'";C?,B;?6B;:E A1EA1EA5 9B:?:#9EA0D@2EA5':>5?:%A;A8 A;?9B;D@/=<?7=9<2A8== Quality scores Note: Illumina used to have a format of fastq that was not compatible with everyone else s format!

Ion quality scores PRINSEQ Rob Schmieder

prinseq

Basic data analysis New dataset Assemble data Perform similarity search

Bad data analysis

New dataset Good data analysis

Good data analysis New dataset Quality control & Preprocessing

Good data analysis New dataset Quality control & Preprocessing Assembly Similarity search

3 Tools for metagenomic data http://prinseq.sourceforge.net http://tagcleaner.sourceforge.net http://deconseq.sourceforge.net

Quality control and data preprocessing

Number and Length of Sequences

Number/Length of sequences Bad Good Reads should be approx. same length (same number of cycles) à Short reads are likely lower quality

Quality of Sequences

Linearly degrading quality across the read àtrim low quality ends

Quality filtering Any region with homopolymer will tend to have a lower quality score Huseet al. found that sequences with an average score below 25 had more errors than those with higher averages Huse et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology (2007)

Low quality sequence issue Most assemblers or aligners do not take into account quality scores Errors in reads complicate assembly, might cause misassembly, or make assembly impossible

What if quality scores are not available? Alternative: Infer quality from the percent of Ns found in the sequence Removes regions with a high number of Ns Huseet al. found that presence of any ambiguous base calls was a sign for overall poor sequence quality Huse et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology (2007)

Ambiguous bases If you can afford the loss, filter out all reads containing Ns Assemblers (e.g. Velvet) and aligners (SHAHA2, BWA, ) use 2-bit encoding system for nucleotides some replace Ns with random base, some with fixed base (e.g. SHAHA2 & Velvet = A) 2-bit example: 00 A, 01 C, 10 G, 11 - T

Tag Sequences

o tag ID tag TA tags

Detect and remove tag sequences

Data upload Tag sequence definition

Tag sequence prediction

Parameter definition Download results

Sequence Contamination

Principal component analysis (PCA) of dinucleotide relative abundance Microbial metagenomes Viral metagenomes

Identification and removal of sequence contamination

Contaminant identification Current methods have critical limitations Dinucleotide relative abundance uses information content in sequences à can not identify single contaminant sequences Sequence similarity seems to be only reliable option to identify single contaminant sequences BLAST against human reference genome is slow and lacks corresponding regions (gaps, variants, ) Novel sequences in every new human genome sequenced* * Li et al.: Building the sequence map of the human pan-genome. Nature Biotechnology (2010)

DeconSeq web interface Two types of reference databases Remove Retain

DeconSeq web interface (cont.)

Human DNA contamination identified in 145 out of 202 metagenomes

Contamination Identification

http://edwards.sdsu.edu/genomepeek

16S

reca

rpob

groel

Using prinseq to check the quality of the data

Using prinseq 1. In the folder, go to /home/qiime/documents/ecoli 2. Right click on CIA49E_coli_DH10B_Control_200.zip Choose Extract Here 3. Right click on R_2014_02_04_00_54_34_user_CIA-49-Ion_PGM_E_coli_DH10B_Control_200 Choose open terminal here 4. Run this command: prinseq-lite.pl -fastq Ecoli.fastq -graph_stats ld,gc,qd,ns,ts -graph_data ~/Desktop/DH10B.gd 5. Open firefox: go to http://edwards.sdsu.edu/prinseq Choose Get Report

After checking the quality: 1. Trim the sequences to remove low quality (<15) at the right end 2. Filter the sequences to remove sequences less than 200 bp. 3. Create a new fastq file prinseq-lite.pl -h

Trimming prinseq-lite.pl -fastq ecoli.fastq -min_len 200 -trim_qual_right 15 -out_good DH10B.trimmed

Convert fastq to fasta On the virtual box: l fastq2qual_fasta.pl l fastq_to_fasta l prinseq-lite.pl On the web: l http://edwards.sdsu.edu/cgi-bin/fastq2fasta.cgi

Using prinseq to convert fastq to fasta prinseq-lite.pl -h prinseq-lite.pl -fastq Ecoli_trimmed.fastq -out_format 2 -out_good Ecoli_trimmed

Sequence Assembly

Assembling the data l Problem: the longest single sequence possible is 1,000 bp, and most technology is 50-500 bp. l Microbial genomes are 2,000,000 bp l Therefore how do you sequence a whole genome?

Sequencing the genomes l Extract DNA l Shear DNA into small pieces l Ligate adapters on each end l Sequencing using next generation sequencing

Sequence assembly l l Before we look at the data Can we make longer pieces

The assembly l l l l A hierarchical data structure that maps sequence data to a reconstruction of the target. The assembly groups l l reads into contigs contigs into scaffolds Contigs provide l l multiple sequence alignment of reads consensus sequence. Scaffolds provide l l contig order and orientation sizes of the gaps between contigs.

Sequence assembly Reads Contigs Scaffolds

Four approaches to assembly Naïve approach Greedy approach Overlap / Layout / Consensus de Bruijn Graphs

Naïve approach l l l l Compare every sequence to every other sequence Find stretches that are the same Need to account for phred scores what if a base is wrong? How long of a sequence do you need to be unique?

Sequence composition l l 4 bases 4 n chance of finding a sequence if all evenly used (they are not) l 3 bp: 4 3 = 64 l 8 bp: 4 8 = 65,336 l 20 bp: 4 20 = 1,099,511,627,776

Problems with this approach l Sequences are not random l Most genomes contain biased information l Repeat sequences in the genome

Greedy approaches Start with a sequence Keep extending it while another sequence matches the end When can not be extended further, mark as a contig

Improve greedy approachs Only use high quality sequence Use reads that are represented more than n-times in the sample (SSAKE) End to end overlap vs. partial overlap Ignores low coverage regions also incorporate quality scores (SHARCGS) In general, greedy approaches are fast but not very good. Make lots of short contigs

Overlap / Layout / Consensus All versus all comparison (done with K-mers for speed). Generate approximate read layout as an overlap graph. Use multiple sequence alignments to resolve layout.

Newbler (O/L/C) Makes unitigs Single contigs with no discrepancies Merge unitigs into contigs. May split unitigs and even reads (could be chimeras) Use coverage to compensate for base calls Works in flow space to calculate homopolymeric tracts. More accurate than average of averages

Assembly is a graph problem Overlap/Layout/Consensus de Bruijn Graph Greedy graphs A graph is nodes + edges node edge

Assemble these two sequences! AACCGGT CCGGTTA Consensus: AACCGGTTA

AACCGGT as graphs Node = K-mers; edges = nodes that overlap by K-1 bases. aacc accg ccgg cggt Here K = 4, but in reality K = 19 to 31

CCGGTTA as graphs ccgg cggt ggtt gtta

Join the two graphs ccgg cggt ggtt gtta aacc accg ccgg cggt

Join the two graphs ccgg cggt ggtt gtta aacc accg ccgg cggt AACCGGTTA

Differences between an overlap graph and a de Bruijn graph for assembly. Schatz M C et al. Genome Res. 2010;20:1165-1173 2010 by Cold Spring Harbor Laboratory Press

Problems with all assemblies l l l Sequences are not random Most genomes contain biased information Repeat sequences in the genome

Repeats

Repeats have multiple sinks/sources

Repeats have multiple sinks/sources 16s Salmonella has 7 rrn operons Salmonella recombines at rrn operons Helm and Maloy

Repeat sequences l l l l What happens if the repeat is longer than the read length? Need paired end reads to resolve order Need pairs that span the repeat Need pairs with one end in the repeat

Mate pair and paired end sequencing

Mate pair Sequencing (Ion Torrent) Add linkers

Mate pair sequencing (Ion Torrent) Nick Sequencing migration

Paired End Sequencing (Illumina) Left read DNA fragment Right read

Paired End Sequencing (Illumina) Number of fragments Fragment length

Paired End Sequencing (Illumina)

Joining paired end sequences _ \ / \ _ \ _) _ / _ \ _) / / \ _ < _ /_/ \_\_ \_\ PEAR v0.9.6 [January 15, 2015] - [+bzlib +zlib] Citation - PEAR: a fast and accurate Illumina Paired-End read merger Zhang et al (2014) Bioinformatics 30(5): 614-620 doi:10.1093/bioinformatics/btt593 License: Creative Commons Licence Bug-reports and requests to: Tomas.Flouri@h-its.org and Jiajie.Zhang@h-its.org Usage: pear <options> Standard (mandatory): -f, --forward-fastq <str> Forward paired-end FASTQ file. -r, --reverse-fastq <str> Reverse paired-end FASTQ file. -o, --output <str> Output filename. http://sco.h-its.org/exelixis/web/software/pear/

Repeats A B C Paired end reads or mate pairs

Sequence assembly Reads Contigs Scaffolds

Current assemblers AMOS Celera WGA Assembler CLC Genomics Workbench DNA Dragon DNAnexus Euler Geneious IDBA (Iterative De Bruijn graph short read Assembler) LIGR Assembler (derived from TIGR Assembler) MIRA (Mimicking Intelligent Read Assembly) Newbler Phrap SSAKE SOAPdenovo SPAdes Velvet

Assembly 1. De novo assembly with newbler (runassembly) 2. Map to the genome with newbler (runmapping) 3. De novo assembly with spades (spades.py)

De novo assembly with newbler 1. Convert fastq to fasta 2. Use this command: runassembly <fasta file> e.g. (all on one line) runassembly -noinfo -nobig -noace Ecoli.fasta 3. How good is the assembly basic_stats 454AllContigs.fna

basic_stats: assembly ------------- BASIC FASTA STATISTICS --------------- Total number of sequences: 1882 Total number of bases: 4,266,422 bp (4.27 Mb) Average sequence length: 2266.96 Minimum sequence length: Maximum sequence length: N50: 3491 bp 100 bp 23892 bp 50% of total sequence length is contained in 369 sequences GC %: 50.79 % -----------------------------------------------------

basic_stats: mapping ------------- BASIC FASTA STATISTICS --------------- Total number of sequences: 654 Total number of bases: 4,339,931 bp (4.34 Mb) Average sequence length: 6635.98 Minimum sequence length: Maximum sequence length: N50: 12274 bp 160 bp 49317 bp 50% of total sequence length is contained in 109 sequences GC %: 50.76 %

N50

N50 Length = N50

Mapping to a genome 1. Convert the E. coli genbank file to fasta format GB2Fasta.pl CP000948_DH10B.gbk CP000948_DH10B.fasta 2. Map the reads to the genome runmapping -cpu 2 -noinfo -noace -nobig -gref CP000948_DH10B.fasta DH10B.trimmed.fasta 3. How good is the assembly basic_stats 454AllContigs.fna

SPAdes 3.7.1 1. Run this command (all on one line): /opt/spades-3.7.1-linux/bin/spades.py --iontorrent -k 21,33,55,77,99,127 --mismatch-correction -t 2 -s ecoli.fastq -o ecoli.spades

Hybrid assembly Geni Silva

scaffold_builder http://edwards.sdsu.edu/scaffold_builder Silva et al. Source Code for Biology and Medicine 2013, 8:23

How do we know if the assembly is good?

QUAST 1. Run this command (all on one line): quast -o Ecoli.quast -R ecoli.fasta runmapping/454allcontigs.fna runassembly/454allcontigs.fna python /opt/quast_3.2/quast.py -o Ecoli.quast -R ecoli.fasta runmapping/454allcontigs.fna runassembly/ 454AllContigs.fna 2. When the command finishes: gnome-open DH10B.quast/report.html (also open DH10B.quast/alignment.svg and DH10B.quast/report.pdf)

Mauve Locally co-linear blocks Homologous regions shared by any two genomes User specified cutoff to homologous

Multi-mums Multiple unique matches Identify k-mers that: Are shared by 2-or more genomes Only appear once per genome Bounded by mismatched base (cannot be extended further)

Mauve Find all local multimums Calculate phylogenetic guide tree Select a subset of multimums to create LCBs Anchor regions not covered with multimums Align LCBs

How to construct... Construct sorted list of k-mers Find set that match occurrence criteria: Once per genome >1 genome Extend until mismatch

Constructing sorted list Hash using hash function: S 1 = start of multimum M in first genome S j = start of multimum M in genome j

Join all mums Join all mums together where: M i S j <= M i+1 S j For any genome j, the start of the multimum i is less than or equal to the start of the next multimum.

Start with all matches

Partition into locally co-linear blocks

Remove low scoring lcb's (3 * length of k-mer)

Reduce k-mer size, and repeat for unaligned regions

Mauve Speed Aligned many prokaryote genomes Human, mouse, and rat genomes

Problems - Repeats For a repetitive element appearing r times in G genomes: r G possible combinations r of them are correct!

Mauve From the Apps menu on the top left, choose Mauve File Align with ProgressiveMauve