Sequence File Formats

Size: px

Start display at page:

Download "Sequence File Formats"

Delilah Marsh
6 years ago
Views:

1 File Formats

2 Sequence File Formats Different formats for different uses Competing formats developed in parallel Some easy to read, some easy to write programs Don t have to stick to these formats, but parsers already written! All formats are plain text

3 Standard genetic code Symbol Meaning Origin G G Guanine A A Adenine C C Cytosine T T Thymine R G or A purine Y T or C pyrimidine M A or C amino K G or T Keto N G or A or T or C any

4 Standard protein codes One Three Amino acid One Three Amino acid A Ala Alanine M Met Methionine C Cys Cysteine N Asn Asparagine D Asp Aspartic acid P Pro Proline E Glu Glutamate R Arg Arginine F Phe Phenylalanine S Ser Serine G Gly Glycine T Thr Threonine H His Histidine V Val Valine I Ile Isoleucine W Trp Tryptophan K Lys Lysine Y Tyr Tyrosine L Leu Leucine X Xaa Unknown

5 GenBank More complex, includes detailed information on genes, cds, annotation etc Human readable Difficult to parse Use standard parsers (bioperl, biojava, etc)

6 LOCUS NC_ bp ss-dna circular PHG 17-APR-2009 DEFINITION Pseudomonas phage Pf3, complete genome. ACCESSION NC_ VERSION NC_ GI: DBLINK Project:14061 KEYWORDS. SOURCE Pseudomonas phage Pf3 ORGANISM Pseudomonas phage Pf3 Viruses; ssdna viruses; Inoviridae; Inovirus. FEATURES Location/Qualifiers source /organism="pseudomonas phage Pf3" /mol_type="genomic DNA" /host="pseudomonas aeruginosa" /db_xref="taxon:10872" /note="pf3 bacteriophage DNA from P.aeruginosa infected with plasmid RP1." gene join( ,1..106) /locus_tag="pf3_1" /db_xref="geneid: " CDS join( ,1..106) /locus_tag="pf3_1" /note="orf 58, part 2" /codon_start=1 /transl_table=11 /product="hypothetical protein" /protein_id="np_ " /db_xref="gi: " /db_xref="geneid: " /translation="msyyvcvqlvndvchewaersdllslpegsglqiggmllllsat AWGIQQIARLLLNR"

7 3241 aggtcctgtt ggccttaaga tcacccaagg gcatcttgcc agatggtacc gtcattactt 3301 atgagaaaat atcctcaatg ggtaatggct ataccttcga gcttgagtcg cttatatttg 3361 cggctcttgc tcggtcttta tgcgaattac tgggcttacg accgtcagat gttacggtct 3421 atggcgatga cataatattg ccatcagacg cgtgcagtcc tctagttgaa gttttctcct 3481 atgttggttt tcgtaccaac aagaagaaaa cgttttctag tggaccgttc cgagagtcgt 3541 gcggaaagca ctactttttg ggcgttgacg tcacaccttt ctacatacgt cgccgtatag 3601 tgagtccctc cgatctcata ctggttttga accagatgta tcgttgggcc acaattgacg 3661 gcgtatggga tcctagggta tatcctgtat acaccaagta tagacgttac cttccggaaa 3721 ttctccggag gaatgtcgtg cctgatggat acggtgatgg tgccctcgtc ggatctgtct 3781 taatcagtcc tttcgcagaa aatcgcggtt gggttcggcg tgtgccgatg attatagaca 3841 agaggaaaga ccgagttcgt gacgaatatg gttcgtatct ctacgagcta tggtcgttgc 3901 agcaactcga atgtgacagt gagttcccct ttaacgggtc gctggtcgtt ggttccactg 3961 atggcactct cgcttacgca caccgagaac ggttacctac cgttatcagt gatgccgtaa 4021 gtgcgtttga catcatgtgg ataccgtgca gtagtcgtgt cctggctccc tacggggatt 4081 tccggaggca cgaaggctct atcctaaaaa tggggtagcg cctgggaggg gtgcattatg 4141 caccctaggt tagcaatact taaactaacc ttctcaaaag agagagtgaa ggctctgctt 4201 tgccctcact cctccca // LOCUS NC_ bp ds-rna linear PHG 23-AUG-2008 DEFINITION Pseudomonas phage phi8 segment S, complete sequence. ACCESSION NC_ VERSION NC_ GI: DBLINK Project:14731 KEYWORDS. SOURCE Pseudomonas phage phi8 ORGANISM Pseudomonas phage phi8 Viruses; dsrna viruses; Cystoviridae; Cystovirus.

8 GFF3 Tab separated format Easy to parse Attributes are tag/value pairs separated by ; Columns: 1. Contig 2. Source database 3. Feature type 4. Start 5. Stop 6. Score 7. Strand 8. Phase 9. Attributes

9 ASN.1 Developed as computer readable form of GenBank Not widely used

10 ASN.1 seq { id { local id 1 }, descr { title "" }, inst { repr raw, mol aa, length 131, topology linear, { seq-data iupacaa "TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQA TGGWKTCSGTCTTSTSTRHRGRSGWSARTTTAACLRASRKSMRAACSRSAGSRPNRFA PTLMSSCITSTTGPPAWAGDRSHE" } }, seq { id { local id 1 }, descr { title "" }, inst { repr raw, mol aa, length 131, topology linear, { seq-data iupacaa "TSPASIRPPAGPSSR RPSPPGPRRPTGRPCCSAAPRRPQAT GGWKTCSGTCTTSTSTRHRGRSGW RASRKSMRAACSRSAGSRPNRFAPTL MSSCITSTTGPPAWAGDRSHE" } }

11 Fasta l Also called Pearson Simplest file format. Easy to parse, easy to use >identifier [optional information] ATGACTAGCATGCATCGATCGATCGACTAGCATG ACTGCACTACGACGACAGCAAC >identifier2 [optional information] ACTAGCTCAGCTAGAGAGCTACGATCAGCACTAC atccgatagcatgacttactacgctagcatcagtcat CAT

12 Qual >identifier [optional information] >identifier2 [optional information]

13 Based on fasta format fastq Contains information about the quality of the sequence Quality comes from sequencing machines! Four lines per sequence: Line = identifier line before the sequence DNA sequence Line starting + = identifier line before the quality scores String = quality scores as ASCII + 33

14 EIXKN4201CFU84 length=93 GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGA AAGGGTTTTGAATTTCAAACCCTTTTCGGTTTCCAA CCTTCCAAAGCAATGCCAATA +SRR EIXKN4201CFU84 length=93 A1EA1EA5

15 Base calling l l l l l Need to be sure which base you have identified Depends on the technology Each machine includes software Phred is an historical package developed by at U. Washington Phred scores are probability that the base is correct

16 Quality values l l l l Phred 10: 1 x 10 1 chance that the base is wrong Phred 20: 1 x 10 2 chance that the base is wrong Phred 30: 1 x 10 3 chance that the base is wrong Phred 40: 1 x 10 4 chance that the base is wrong l Phred 99: the base is correct! l Fastq scores are the score + 33 then converted to ascii text

17 ASCII character codes ASCII Char ASCII Char ASCII Char ASCII Char ASCII Char 33! F 90 Z 110 n 34 " G 91 [ 111 o 35 # H 92 \ 112 p 36 $ I 93 ] 113 q 37 % J 94 ^ 114 r 38 & K 95 _ 115 s 39 ' L 96 ` 116 t 40 ( M 97 a 117 u 41 ) 58 : 78 N 98 b 118 v 42 * 59 ; 79 O 99 c 119 w < 80 P 100 d 120 x 44, 61 = 81 Q 101 e 121 y > 82 R 102 f 122 z ? 83 S 103 g 123 { 47 / 84 T 104 h A 85 U 105 i 125 } B 86 V 106 j 126 ~

18 EIXKN4201CFU84 length=93 DNA sequence GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGA AAGGGTTTTGAATTTCAAACCCTTTTCGGTTTCCAA CCTTCCAAAGCAATGCCAATA +SRR EIXKN4201CFU84 length=93 A1EA1EA5 Quality scores Note: Illumina used to have a format of fastq that was not compatible with everyone else s format!

19 Ion quality scores PRINSEQ Rob Schmieder

20 prinseq

21 Basic data analysis New dataset Assemble data Perform similarity search

22 Bad data analysis

23 Bad data analysis

24 Bad data analysis

25 Bad data analysis

26 Bad data analysis

27 Bad data analysis

28 New dataset Good data analysis

29 Good data analysis New dataset Quality control & Preprocessing

30 Good data analysis New dataset Quality control & Preprocessing Assembly Similarity search

31 Good data analysis New dataset Quality control & Preprocessing Assembly Similarity search

32 3 Tools for metagenomic data

33 Quality control and data preprocessing

34 Number and Length of Sequences

35 Number/Length of sequences Bad Good Reads should be approx. same length (same number of cycles) à Short reads are likely lower quality

36 Quality of Sequences

37 Linearly degrading quality across the read àtrim low quality ends

38 Quality filtering Any region with homopolymer will tend to have a lower quality score Huseet al. found that sequences with an average score below 25 had more errors than those with higher averages Huse et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology (2007)

39 Low quality sequence issue Most assemblers or aligners do not take into account quality scores Errors in reads complicate assembly, might cause misassembly, or make assembly impossible

40 What if quality scores are not available? Alternative: Infer quality from the percent of Ns found in the sequence Removes regions with a high number of Ns Huseet al. found that presence of any ambiguous base calls was a sign for overall poor sequence quality Huse et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology (2007)

41 Ambiguous bases If you can afford the loss, filter out all reads containing Ns Assemblers (e.g. Velvet) and aligners (SHAHA2, BWA, ) use 2-bit encoding system for nucleotides some replace Ns with random base, some with fixed base (e.g. SHAHA2 & Velvet = A) 2-bit example: 00 A, 01 C, 10 G, 11 - T

42 Tag Sequences

43 o tag ID tag TA tags

44 Detect and remove tag sequences

45 Data upload Tag sequence definition

46 Tag sequence prediction

47 Parameter definition Download results

48 Sequence Contamination

49 Principal component analysis (PCA) of dinucleotide relative abundance Microbial metagenomes Viral metagenomes

50 Identification and removal of sequence contamination

51 Contaminant identification Current methods have critical limitations Dinucleotide relative abundance uses information content in sequences à can not identify single contaminant sequences Sequence similarity seems to be only reliable option to identify single contaminant sequences BLAST against human reference genome is slow and lacks corresponding regions (gaps, variants, ) Novel sequences in every new human genome sequenced* * Li et al.: Building the sequence map of the human pan-genome. Nature Biotechnology (2010)

52 DeconSeq web interface Two types of reference databases Remove Retain

53 DeconSeq web interface (cont.)

54 Human DNA contamination identified in 145 out of 202 metagenomes

55 Contamination Identification

57 16S

58 reca

59 rpob

60 groel

61 Using prinseq to check the quality of the data

62 Using prinseq 1. In the folder, go to /home/qiime/documents/ecoli 2. Right click on CIA49E_coli_DH10B_Control_200.zip Choose Extract Here 3. Right click on R_2014_02_04_00_54_34_user_CIA-49-Ion_PGM_E_coli_DH10B_Control_200 Choose open terminal here 4. Run this command: prinseq-lite.pl -fastq Ecoli.fastq -graph_stats ld,gc,qd,ns,ts -graph_data ~/Desktop/DH10B.gd 5. Open firefox: go to Choose Get Report

63 After checking the quality: 1. Trim the sequences to remove low quality (<15) at the right end 2. Filter the sequences to remove sequences less than 200 bp. 3. Create a new fastq file prinseq-lite.pl -h

64 Trimming prinseq-lite.pl -fastq ecoli.fastq -min_len 200 -trim_qual_right 15 -out_good DH10B.trimmed

65 Convert fastq to fasta On the virtual box: l fastq2qual_fasta.pl l fastq_to_fasta l prinseq-lite.pl On the web: l

66 Using prinseq to convert fastq to fasta prinseq-lite.pl -h prinseq-lite.pl -fastq Ecoli_trimmed.fastq -out_format 2 -out_good Ecoli_trimmed

67 Sequence Assembly

68 Assembling the data l Problem: the longest single sequence possible is 1,000 bp, and most technology is bp. l Microbial genomes are 2,000,000 bp l Therefore how do you sequence a whole genome?

69 Sequencing the genomes l Extract DNA l Shear DNA into small pieces l Ligate adapters on each end l Sequencing using next generation sequencing

70 Sequence assembly l l Before we look at the data Can we make longer pieces

71 The assembly l l l l A hierarchical data structure that maps sequence data to a reconstruction of the target. The assembly groups l l reads into contigs contigs into scaffolds Contigs provide l l multiple sequence alignment of reads consensus sequence. Scaffolds provide l l contig order and orientation sizes of the gaps between contigs.

72 Sequence assembly Reads Contigs Scaffolds

73 Four approaches to assembly Naïve approach Greedy approach Overlap / Layout / Consensus de Bruijn Graphs

74 Naïve approach l l l l Compare every sequence to every other sequence Find stretches that are the same Need to account for phred scores what if a base is wrong? How long of a sequence do you need to be unique?

75 Sequence composition l l 4 bases 4 n chance of finding a sequence if all evenly used (they are not) l 3 bp: 4 3 = 64 l 8 bp: 4 8 = 65,336 l 20 bp: 4 20 = 1,099,511,627,776

76 Problems with this approach l Sequences are not random l Most genomes contain biased information l Repeat sequences in the genome

77 Greedy approaches Start with a sequence Keep extending it while another sequence matches the end When can not be extended further, mark as a contig

78 Improve greedy approachs Only use high quality sequence Use reads that are represented more than n-times in the sample (SSAKE) End to end overlap vs. partial overlap Ignores low coverage regions also incorporate quality scores (SHARCGS) In general, greedy approaches are fast but not very good. Make lots of short contigs

79 Overlap / Layout / Consensus All versus all comparison (done with K-mers for speed). Generate approximate read layout as an overlap graph. Use multiple sequence alignments to resolve layout.

80 Newbler (O/L/C) Makes unitigs Single contigs with no discrepancies Merge unitigs into contigs. May split unitigs and even reads (could be chimeras) Use coverage to compensate for base calls Works in flow space to calculate homopolymeric tracts. More accurate than average of averages

81 Assembly is a graph problem Overlap/Layout/Consensus de Bruijn Graph Greedy graphs A graph is nodes + edges node edge

82 Assemble these two sequences! AACCGGT CCGGTTA Consensus: AACCGGTTA

83 AACCGGT as graphs Node = K-mers; edges = nodes that overlap by K-1 bases. aacc accg ccgg cggt Here K = 4, but in reality K = 19 to 31

84 CCGGTTA as graphs ccgg cggt ggtt gtta

85 Join the two graphs ccgg cggt ggtt gtta aacc accg ccgg cggt

86 Join the two graphs ccgg cggt ggtt gtta aacc accg ccgg cggt

87 Join the two graphs ccgg cggt ggtt gtta aacc accg ccgg cggt AACCGGTTA

88 Differences between an overlap graph and a de Bruijn graph for assembly. Schatz M C et al. Genome Res. 2010;20: by Cold Spring Harbor Laboratory Press

89 Problems with all assemblies l l l Sequences are not random Most genomes contain biased information Repeat sequences in the genome

90 Repeats

91 Repeats

92 Repeats

93 Repeats

94 Repeats have multiple sinks/sources

95 Repeats have multiple sinks/sources 16s Salmonella has 7 rrn operons Salmonella recombines at rrn operons Helm and Maloy

96 Repeat sequences l l l l What happens if the repeat is longer than the read length? Need paired end reads to resolve order Need pairs that span the repeat Need pairs with one end in the repeat

97 Mate pair and paired end sequencing

98 Mate pair Sequencing (Ion Torrent) Add linkers

99 Mate pair sequencing (Ion Torrent) Nick Sequencing migration

100 Paired End Sequencing (Illumina) Left read DNA fragment Right read

101 Paired End Sequencing (Illumina) Number of fragments Fragment length

102 Paired End Sequencing (Illumina)

103 Joining paired end sequences _ \ / \ _ \ _) _ / _ \ _) / / \ _ < _ /_/ \_\_ \_\ PEAR v0.9.6 [January 15, 2015] - [+bzlib +zlib] Citation - PEAR: a fast and accurate Illumina Paired-End read merger Zhang et al (2014) Bioinformatics 30(5): doi: /bioinformatics/btt593 License: Creative Commons Licence Bug-reports and requests to: Tomas.Flouri@h-its.org and Jiajie.Zhang@h-its.org Usage: pear <options> Standard (mandatory): -f, --forward-fastq <str> Forward paired-end FASTQ file. -r, --reverse-fastq <str> Reverse paired-end FASTQ file. -o, --output <str> Output filename.

104 Repeats A B C Paired end reads or mate pairs

105 Sequence assembly Reads Contigs Scaffolds

106 Current assemblers AMOS Celera WGA Assembler CLC Genomics Workbench DNA Dragon DNAnexus Euler Geneious IDBA (Iterative De Bruijn graph short read Assembler) LIGR Assembler (derived from TIGR Assembler) MIRA (Mimicking Intelligent Read Assembly) Newbler Phrap SSAKE SOAPdenovo SPAdes Velvet

107 Assembly 1. De novo assembly with newbler (runassembly) 2. Map to the genome with newbler (runmapping) 3. De novo assembly with spades (spades.py)

108 De novo assembly with newbler 1. Convert fastq to fasta 2. Use this command: runassembly <fasta file> e.g. (all on one line) runassembly -noinfo -nobig -noace Ecoli.fasta 3. How good is the assembly basic_stats 454AllContigs.fna

109 basic_stats: assembly BASIC FASTA STATISTICS Total number of sequences: 1882 Total number of bases: 4,266,422 bp (4.27 Mb) Average sequence length: Minimum sequence length: Maximum sequence length: N50: 3491 bp 100 bp bp 50% of total sequence length is contained in 369 sequences GC %: %

110 basic_stats: mapping BASIC FASTA STATISTICS Total number of sequences: 654 Total number of bases: 4,339,931 bp (4.34 Mb) Average sequence length: Minimum sequence length: Maximum sequence length: N50: bp 160 bp bp 50% of total sequence length is contained in 109 sequences GC %: %

111 N50

112 N50 Length = N50

113 Mapping to a genome 1. Convert the E. coli genbank file to fasta format GB2Fasta.pl CP000948_DH10B.gbk CP000948_DH10B.fasta 2. Map the reads to the genome runmapping -cpu 2 -noinfo -noace -nobig -gref CP000948_DH10B.fasta DH10B.trimmed.fasta 3. How good is the assembly basic_stats 454AllContigs.fna

114 SPAdes Run this command (all on one line): /opt/spades linux/bin/spades.py --iontorrent -k 21,33,55,77,99,127 --mismatch-correction -t 2 -s ecoli.fastq -o ecoli.spades

115 Hybrid assembly Geni Silva

116 scaffold_builder Silva et al. Source Code for Biology and Medicine 2013, 8:23

117 How do we know if the assembly is good?

118 QUAST 1. Run this command (all on one line): quast -o Ecoli.quast -R ecoli.fasta runmapping/454allcontigs.fna runassembly/454allcontigs.fna python /opt/quast_3.2/quast.py -o Ecoli.quast -R ecoli.fasta runmapping/454allcontigs.fna runassembly/ 454AllContigs.fna 2. When the command finishes: gnome-open DH10B.quast/report.html (also open DH10B.quast/alignment.svg and DH10B.quast/report.pdf)

119 Mauve Locally co-linear blocks Homologous regions shared by any two genomes User specified cutoff to homologous

120 Multi-mums Multiple unique matches Identify k-mers that: Are shared by 2-or more genomes Only appear once per genome Bounded by mismatched base (cannot be extended further)

121 Mauve Find all local multimums Calculate phylogenetic guide tree Select a subset of multimums to create LCBs Anchor regions not covered with multimums Align LCBs

122 How to construct... Construct sorted list of k-mers Find set that match occurrence criteria: Once per genome >1 genome Extend until mismatch

123 Constructing sorted list Hash using hash function: S 1 = start of multimum M in first genome S j = start of multimum M in genome j

124 Join all mums Join all mums together where: M i S j <= M i+1 S j For any genome j, the start of the multimum i is less than or equal to the start of the next multimum.

125 Start with all matches

126 Partition into locally co-linear blocks

127 Remove low scoring lcb's (3 * length of k-mer)

128 Reduce k-mer size, and repeat for unaligned regions

129 Mauve Speed Aligned many prokaryote genomes Human, mouse, and rat genomes

130 Problems - Repeats For a repetitive element appearing r times in G genomes: r G possible combinations r of them are correct!

131 Mauve From the Apps menu on the top left, choose Mauve File Align with ProgressiveMauve

Basic concepts of molecular biology

Basic concepts of molecular biology Gabriella Trucco Email: gabriella.trucco@unimi.it Life The main actors in the chemistry of life are molecules called proteins nucleic acids Proteins: many different