File Formats
Sequence File Formats Different formats for different uses Competing formats developed in parallel Some easy to read, some easy to write programs Don t have to stick to these formats, but parsers already written! All formats are plain text
Standard genetic code Symbol Meaning Origin G G Guanine A A Adenine C C Cytosine T T Thymine R G or A purine Y T or C pyrimidine M A or C amino K G or T Keto N G or A or T or C any
Standard protein codes One Three Amino acid One Three Amino acid A Ala Alanine M Met Methionine C Cys Cysteine N Asn Asparagine D Asp Aspartic acid P Pro Proline E Glu Glutamate R Arg Arginine F Phe Phenylalanine S Ser Serine G Gly Glycine T Thr Threonine H His Histidine V Val Valine I Ile Isoleucine W Trp Tryptophan K Lys Lysine Y Tyr Tyrosine L Leu Leucine X Xaa Unknown
GenBank More complex, includes detailed information on genes, cds, annotation etc Human readable Difficult to parse Use standard parsers (bioperl, biojava, etc)
LOCUS NC_001418 5833 bp ss-dna circular PHG 17-APR-2009 DEFINITION Pseudomonas phage Pf3, complete genome. ACCESSION NC_001418 VERSION NC_001418.1 GI:9626316 DBLINK Project:14061 KEYWORDS. SOURCE Pseudomonas phage Pf3 ORGANISM Pseudomonas phage Pf3 Viruses; ssdna viruses; Inoviridae; Inovirus. FEATURES Location/Qualifiers source 1..5833 /organism="pseudomonas phage Pf3" /mol_type="genomic DNA" /host="pseudomonas aeruginosa" /db_xref="taxon:10872" /note="pf3 bacteriophage DNA from P.aeruginosa infected with plasmid RP1." gene join(5763..5833,1..106) /locus_tag="pf3_1" /db_xref="geneid:1260905" CDS join(5763..5833,1..106) /locus_tag="pf3_1" /note="orf 58, part 2" /codon_start=1 /transl_table=11 /product="hypothetical protein" /protein_id="np_040651.1" /db_xref="gi:9626317" /db_xref="geneid:1260905" /translation="msyyvcvqlvndvchewaersdllslpegsglqiggmllllsat AWGIQQIARLLLNR"
3241 aggtcctgtt ggccttaaga tcacccaagg gcatcttgcc agatggtacc gtcattactt 3301 atgagaaaat atcctcaatg ggtaatggct ataccttcga gcttgagtcg cttatatttg 3361 cggctcttgc tcggtcttta tgcgaattac tgggcttacg accgtcagat gttacggtct 3421 atggcgatga cataatattg ccatcagacg cgtgcagtcc tctagttgaa gttttctcct 3481 atgttggttt tcgtaccaac aagaagaaaa cgttttctag tggaccgttc cgagagtcgt 3541 gcggaaagca ctactttttg ggcgttgacg tcacaccttt ctacatacgt cgccgtatag 3601 tgagtccctc cgatctcata ctggttttga accagatgta tcgttgggcc acaattgacg 3661 gcgtatggga tcctagggta tatcctgtat acaccaagta tagacgttac cttccggaaa 3721 ttctccggag gaatgtcgtg cctgatggat acggtgatgg tgccctcgtc ggatctgtct 3781 taatcagtcc tttcgcagaa aatcgcggtt gggttcggcg tgtgccgatg attatagaca 3841 agaggaaaga ccgagttcgt gacgaatatg gttcgtatct ctacgagcta tggtcgttgc 3901 agcaactcga atgtgacagt gagttcccct ttaacgggtc gctggtcgtt ggttccactg 3961 atggcactct cgcttacgca caccgagaac ggttacctac cgttatcagt gatgccgtaa 4021 gtgcgtttga catcatgtgg ataccgtgca gtagtcgtgt cctggctccc tacggggatt 4081 tccggaggca cgaaggctct atcctaaaaa tggggtagcg cctgggaggg gtgcattatg 4141 caccctaggt tagcaatact taaactaacc ttctcaaaag agagagtgaa ggctctgctt 4201 tgccctcact cctccca // LOCUS NC_003301 3192 bp ds-rna linear PHG 23-AUG-2008 DEFINITION Pseudomonas phage phi8 segment S, complete sequence. ACCESSION NC_003301 VERSION NC_003301.1 GI:17736965 DBLINK Project:14731 KEYWORDS. SOURCE Pseudomonas phage phi8 ORGANISM Pseudomonas phage phi8 Viruses; dsrna viruses; Cystoviridae; Cystovirus.
GFF3 Tab separated format Easy to parse Attributes are tag/value pairs separated by ; Columns: 1. Contig 2. Source database 3. Feature type 4. Start 5. Stop 6. Score 7. Strand 8. Phase 9. Attributes
ASN.1 Developed as computer readable form of GenBank Not widely used
ASN.1 seq { id { local id 1 }, descr { title "" }, inst { repr raw, mol aa, length 131, topology linear, { seq-data iupacaa "TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQA TGGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSRSAGSRPNRFA PTLMSSCITSTTGPPAWAGDRSHE" } }, seq { id { local id 1 }, descr { title "" }, inst { repr raw, mol aa, length 131, topology linear, { seq-data iupacaa "TSPASIRPPAGPSSR---------RPSPPGPRRPTGRPCCSAAPRRPQAT GGWKTCSGTCTTSTSTRHRGRSGW----------RASRKSMRAACSRSAGSRPNRFAPTL MSSCITSTTGPPAWAGDRSHE" } }
Fasta l Also called Pearson Simplest file format. Easy to parse, easy to use >identifier [optional information] ATGACTAGCATGCATCGATCGATCGACTAGCATG ACTGCACTACGACGACAGCAAC >identifier2 [optional information] ACTAGCTCAGCTAGAGAGCTACGATCAGCACTAC atccgatagcatgacttactacgctagcatcagtcat CAT
Qual >identifier [optional information] 35 35 35 35 22 35 35 35 31 35 31 35 34 35 35 31 35 36 35 37 36 36 36 22 35 34 31 35 35 35 35 25 >identifier2 [optional information] 35 36 37 31 35 35 28 28 28 35 34 34 33 32 34 34 33 29 34 36 15 28 27 27 27 27 21 35 35 35 33 33
Based on fasta format fastq Contains information about the quality of the sequence Quality comes from sequencing machines! Four lines per sequence: Line starting @ = identifier line before the sequence DNA sequence Line starting + = identifier line before the quality scores String = quality scores as ASCII + 33
fastq @SRR014849.1 EIXKN4201CFU84 length=93 GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGA AAGGGTTTTGAATTTCAAACCCTTTTCGGTTTCCAA CCTTCCAAAGCAATGCCAATA +SRR014849.1 EIXKN4201CFU84 length=93 3+&$#"""""""""""7F@71,'";C?,B;?6B;:E A1EA1EA5 9B:?:#9EA0D@2EA5':>5?:%A;A8 A;?9B;D@/=<?7=9<2A8==
Base calling l l l l l Need to be sure which base you have identified Depends on the technology Each machine includes software Phred is an historical package developed by at U. Washington Phred scores are probability that the base is correct
Quality values l l l l Phred 10: 1 x 10 1 chance that the base is wrong Phred 20: 1 x 10 2 chance that the base is wrong Phred 30: 1 x 10 3 chance that the base is wrong Phred 40: 1 x 10 4 chance that the base is wrong l Phred 99: the base is correct! l Fastq scores are the score + 33 then converted to ascii text
ASCII character codes ASCII Char ASCII Char ASCII Char ASCII Char ASCII Char 33! 50 2 70 F 90 Z 110 n 34 " 51 3 71 G 91 [ 111 o 35 # 52 4 72 H 92 \ 112 p 36 $ 53 5 73 I 93 ] 113 q 37 % 54 6 74 J 94 ^ 114 r 38 & 55 7 75 K 95 _ 115 s 39 ' 56 8 76 L 96 ` 116 t 40 ( 57 9 77 M 97 a 117 u 41 ) 58 : 78 N 98 b 118 v 42 * 59 ; 79 O 99 c 119 w 43 + 60 < 80 P 100 d 120 x 44, 61 = 81 Q 101 e 121 y 45-62 > 82 R 102 f 122 z 46. 63? 83 S 103 g 123 { 47 / 64 @ 84 T 104 h 124 48 0 65 A 85 U 105 i 125 } 49 1 66 B 86 V 106 j 126 ~
fastq @SRR014849.1 EIXKN4201CFU84 length=93 DNA sequence GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGA AAGGGTTTTGAATTTCAAACCCTTTTCGGTTTCCAA CCTTCCAAAGCAATGCCAATA +SRR014849.1 EIXKN4201CFU84 length=93 3+&$#"""""""""""7F@71,'";C?,B;?6B;:E A1EA1EA5 9B:?:#9EA0D@2EA5':>5?:%A;A8 A;?9B;D@/=<?7=9<2A8== Quality scores Note: Illumina used to have a format of fastq that was not compatible with everyone else s format!
Ion quality scores PRINSEQ Rob Schmieder
prinseq
Basic data analysis New dataset Assemble data Perform similarity search
Bad data analysis
Bad data analysis
Bad data analysis
Bad data analysis
Bad data analysis
Bad data analysis
New dataset Good data analysis
Good data analysis New dataset Quality control & Preprocessing
Good data analysis New dataset Quality control & Preprocessing Assembly Similarity search
Good data analysis New dataset Quality control & Preprocessing Assembly Similarity search
3 Tools for metagenomic data http://prinseq.sourceforge.net http://tagcleaner.sourceforge.net http://deconseq.sourceforge.net
Quality control and data preprocessing
Number and Length of Sequences
Number/Length of sequences Bad Good Reads should be approx. same length (same number of cycles) à Short reads are likely lower quality
Quality of Sequences
Linearly degrading quality across the read àtrim low quality ends
Quality filtering Any region with homopolymer will tend to have a lower quality score Huseet al. found that sequences with an average score below 25 had more errors than those with higher averages Huse et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology (2007)
Low quality sequence issue Most assemblers or aligners do not take into account quality scores Errors in reads complicate assembly, might cause misassembly, or make assembly impossible
What if quality scores are not available? Alternative: Infer quality from the percent of Ns found in the sequence Removes regions with a high number of Ns Huseet al. found that presence of any ambiguous base calls was a sign for overall poor sequence quality Huse et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology (2007)
Ambiguous bases If you can afford the loss, filter out all reads containing Ns Assemblers (e.g. Velvet) and aligners (SHAHA2, BWA, ) use 2-bit encoding system for nucleotides some replace Ns with random base, some with fixed base (e.g. SHAHA2 & Velvet = A) 2-bit example: 00 A, 01 C, 10 G, 11 - T
Tag Sequences
o tag ID tag TA tags
Detect and remove tag sequences
Data upload Tag sequence definition
Tag sequence prediction
Parameter definition Download results
Sequence Contamination
Principal component analysis (PCA) of dinucleotide relative abundance Microbial metagenomes Viral metagenomes
Identification and removal of sequence contamination
Contaminant identification Current methods have critical limitations Dinucleotide relative abundance uses information content in sequences à can not identify single contaminant sequences Sequence similarity seems to be only reliable option to identify single contaminant sequences BLAST against human reference genome is slow and lacks corresponding regions (gaps, variants, ) Novel sequences in every new human genome sequenced* * Li et al.: Building the sequence map of the human pan-genome. Nature Biotechnology (2010)
DeconSeq web interface Two types of reference databases Remove Retain
DeconSeq web interface (cont.)
Human DNA contamination identified in 145 out of 202 metagenomes
Contamination Identification
http://edwards.sdsu.edu/genomepeek
16S
reca
rpob
groel
Using prinseq to check the quality of the data
Using prinseq 1. In the folder, go to /home/qiime/documents/ecoli 2. Right click on CIA49E_coli_DH10B_Control_200.zip Choose Extract Here 3. Right click on R_2014_02_04_00_54_34_user_CIA-49-Ion_PGM_E_coli_DH10B_Control_200 Choose open terminal here 4. Run this command: prinseq-lite.pl -fastq Ecoli.fastq -graph_stats ld,gc,qd,ns,ts -graph_data ~/Desktop/DH10B.gd 5. Open firefox: go to http://edwards.sdsu.edu/prinseq Choose Get Report
After checking the quality: 1. Trim the sequences to remove low quality (<15) at the right end 2. Filter the sequences to remove sequences less than 200 bp. 3. Create a new fastq file prinseq-lite.pl -h
Trimming prinseq-lite.pl -fastq ecoli.fastq -min_len 200 -trim_qual_right 15 -out_good DH10B.trimmed
Convert fastq to fasta On the virtual box: l fastq2qual_fasta.pl l fastq_to_fasta l prinseq-lite.pl On the web: l http://edwards.sdsu.edu/cgi-bin/fastq2fasta.cgi
Using prinseq to convert fastq to fasta prinseq-lite.pl -h prinseq-lite.pl -fastq Ecoli_trimmed.fastq -out_format 2 -out_good Ecoli_trimmed
Sequence Assembly
Assembling the data l Problem: the longest single sequence possible is 1,000 bp, and most technology is 50-500 bp. l Microbial genomes are 2,000,000 bp l Therefore how do you sequence a whole genome?
Sequencing the genomes l Extract DNA l Shear DNA into small pieces l Ligate adapters on each end l Sequencing using next generation sequencing
Sequence assembly l l Before we look at the data Can we make longer pieces
The assembly l l l l A hierarchical data structure that maps sequence data to a reconstruction of the target. The assembly groups l l reads into contigs contigs into scaffolds Contigs provide l l multiple sequence alignment of reads consensus sequence. Scaffolds provide l l contig order and orientation sizes of the gaps between contigs.
Sequence assembly Reads Contigs Scaffolds
Four approaches to assembly Naïve approach Greedy approach Overlap / Layout / Consensus de Bruijn Graphs
Naïve approach l l l l Compare every sequence to every other sequence Find stretches that are the same Need to account for phred scores what if a base is wrong? How long of a sequence do you need to be unique?
Sequence composition l l 4 bases 4 n chance of finding a sequence if all evenly used (they are not) l 3 bp: 4 3 = 64 l 8 bp: 4 8 = 65,336 l 20 bp: 4 20 = 1,099,511,627,776
Problems with this approach l Sequences are not random l Most genomes contain biased information l Repeat sequences in the genome
Greedy approaches Start with a sequence Keep extending it while another sequence matches the end When can not be extended further, mark as a contig
Improve greedy approachs Only use high quality sequence Use reads that are represented more than n-times in the sample (SSAKE) End to end overlap vs. partial overlap Ignores low coverage regions also incorporate quality scores (SHARCGS) In general, greedy approaches are fast but not very good. Make lots of short contigs
Overlap / Layout / Consensus All versus all comparison (done with K-mers for speed). Generate approximate read layout as an overlap graph. Use multiple sequence alignments to resolve layout.
Newbler (O/L/C) Makes unitigs Single contigs with no discrepancies Merge unitigs into contigs. May split unitigs and even reads (could be chimeras) Use coverage to compensate for base calls Works in flow space to calculate homopolymeric tracts. More accurate than average of averages
Assembly is a graph problem Overlap/Layout/Consensus de Bruijn Graph Greedy graphs A graph is nodes + edges node edge
Assemble these two sequences! AACCGGT CCGGTTA Consensus: AACCGGTTA
AACCGGT as graphs Node = K-mers; edges = nodes that overlap by K-1 bases. aacc accg ccgg cggt Here K = 4, but in reality K = 19 to 31
CCGGTTA as graphs ccgg cggt ggtt gtta
Join the two graphs ccgg cggt ggtt gtta aacc accg ccgg cggt
Join the two graphs ccgg cggt ggtt gtta aacc accg ccgg cggt
Join the two graphs ccgg cggt ggtt gtta aacc accg ccgg cggt AACCGGTTA
Differences between an overlap graph and a de Bruijn graph for assembly. Schatz M C et al. Genome Res. 2010;20:1165-1173 2010 by Cold Spring Harbor Laboratory Press
Problems with all assemblies l l l Sequences are not random Most genomes contain biased information Repeat sequences in the genome
Repeats
Repeats
Repeats
Repeats
Repeats have multiple sinks/sources
Repeats have multiple sinks/sources 16s Salmonella has 7 rrn operons Salmonella recombines at rrn operons Helm and Maloy
Repeat sequences l l l l What happens if the repeat is longer than the read length? Need paired end reads to resolve order Need pairs that span the repeat Need pairs with one end in the repeat
Mate pair and paired end sequencing
Mate pair Sequencing (Ion Torrent) Add linkers
Mate pair sequencing (Ion Torrent) Nick Sequencing migration
Paired End Sequencing (Illumina) Left read DNA fragment Right read
Paired End Sequencing (Illumina) Number of fragments Fragment length
Paired End Sequencing (Illumina)
Joining paired end sequences _ \ / \ _ \ _) _ / _ \ _) / / \ _ < _ /_/ \_\_ \_\ PEAR v0.9.6 [January 15, 2015] - [+bzlib +zlib] Citation - PEAR: a fast and accurate Illumina Paired-End read merger Zhang et al (2014) Bioinformatics 30(5): 614-620 doi:10.1093/bioinformatics/btt593 License: Creative Commons Licence Bug-reports and requests to: Tomas.Flouri@h-its.org and Jiajie.Zhang@h-its.org Usage: pear <options> Standard (mandatory): -f, --forward-fastq <str> Forward paired-end FASTQ file. -r, --reverse-fastq <str> Reverse paired-end FASTQ file. -o, --output <str> Output filename. http://sco.h-its.org/exelixis/web/software/pear/
Repeats A B C Paired end reads or mate pairs
Sequence assembly Reads Contigs Scaffolds
Current assemblers AMOS Celera WGA Assembler CLC Genomics Workbench DNA Dragon DNAnexus Euler Geneious IDBA (Iterative De Bruijn graph short read Assembler) LIGR Assembler (derived from TIGR Assembler) MIRA (Mimicking Intelligent Read Assembly) Newbler Phrap SSAKE SOAPdenovo SPAdes Velvet
Assembly 1. De novo assembly with newbler (runassembly) 2. Map to the genome with newbler (runmapping) 3. De novo assembly with spades (spades.py)
De novo assembly with newbler 1. Convert fastq to fasta 2. Use this command: runassembly <fasta file> e.g. (all on one line) runassembly -noinfo -nobig -noace Ecoli.fasta 3. How good is the assembly basic_stats 454AllContigs.fna
basic_stats: assembly ------------- BASIC FASTA STATISTICS --------------- Total number of sequences: 1882 Total number of bases: 4,266,422 bp (4.27 Mb) Average sequence length: 2266.96 Minimum sequence length: Maximum sequence length: N50: 3491 bp 100 bp 23892 bp 50% of total sequence length is contained in 369 sequences GC %: 50.79 % -----------------------------------------------------
basic_stats: mapping ------------- BASIC FASTA STATISTICS --------------- Total number of sequences: 654 Total number of bases: 4,339,931 bp (4.34 Mb) Average sequence length: 6635.98 Minimum sequence length: Maximum sequence length: N50: 12274 bp 160 bp 49317 bp 50% of total sequence length is contained in 109 sequences GC %: 50.76 %
N50
N50 Length = N50
Mapping to a genome 1. Convert the E. coli genbank file to fasta format GB2Fasta.pl CP000948_DH10B.gbk CP000948_DH10B.fasta 2. Map the reads to the genome runmapping -cpu 2 -noinfo -noace -nobig -gref CP000948_DH10B.fasta DH10B.trimmed.fasta 3. How good is the assembly basic_stats 454AllContigs.fna
SPAdes 3.7.1 1. Run this command (all on one line): /opt/spades-3.7.1-linux/bin/spades.py --iontorrent -k 21,33,55,77,99,127 --mismatch-correction -t 2 -s ecoli.fastq -o ecoli.spades
Hybrid assembly Geni Silva
scaffold_builder http://edwards.sdsu.edu/scaffold_builder Silva et al. Source Code for Biology and Medicine 2013, 8:23
How do we know if the assembly is good?
QUAST 1. Run this command (all on one line): quast -o Ecoli.quast -R ecoli.fasta runmapping/454allcontigs.fna runassembly/454allcontigs.fna python /opt/quast_3.2/quast.py -o Ecoli.quast -R ecoli.fasta runmapping/454allcontigs.fna runassembly/ 454AllContigs.fna 2. When the command finishes: gnome-open DH10B.quast/report.html (also open DH10B.quast/alignment.svg and DH10B.quast/report.pdf)
Mauve Locally co-linear blocks Homologous regions shared by any two genomes User specified cutoff to homologous
Multi-mums Multiple unique matches Identify k-mers that: Are shared by 2-or more genomes Only appear once per genome Bounded by mismatched base (cannot be extended further)
Mauve Find all local multimums Calculate phylogenetic guide tree Select a subset of multimums to create LCBs Anchor regions not covered with multimums Align LCBs
How to construct... Construct sorted list of k-mers Find set that match occurrence criteria: Once per genome >1 genome Extend until mismatch
Constructing sorted list Hash using hash function: S 1 = start of multimum M in first genome S j = start of multimum M in genome j
Join all mums Join all mums together where: M i S j <= M i+1 S j For any genome j, the start of the multimum i is less than or equal to the start of the next multimum.
Start with all matches
Partition into locally co-linear blocks
Remove low scoring lcb's (3 * length of k-mer)
Reduce k-mer size, and repeat for unaligned regions
Mauve Speed Aligned many prokaryote genomes Human, mouse, and rat genomes
Problems - Repeats For a repetitive element appearing r times in G genomes: r G possible combinations r of them are correct!
Mauve From the Apps menu on the top left, choose Mauve File Align with ProgressiveMauve