A framework for a general- purpose sequence compression pipeline: a centroid based compression

Size: px
Start display at page:

Download "A framework for a general- purpose sequence compression pipeline: a centroid based compression"

Transcription

1 A framework for a general- purpose sequence compression pipeline: a centroid based compression Liqing Zhang 1, Daniel Nasko 2, Martin Tříska 3, Harold Garner 4 1. Department of Computer Science, Programs in Genetics, Biochemistry, and Computational Biology, Virginia Tech. 2. Center for Bioinformatics and Computational Biology, University of Delaware 3. Glamorgan Computational Biology Research Group, University of Glamorgan, United Kingdom 4. Virginia Bioinformatics Institute. Virginia Tech Abstract DNA sequence data accumulate at an overwhelmingly fast speed, overtaking the speed of the increase of disk storage and creating enormous challenges to data storage, processing, and analysis. Taking advantage of the fact that two human genomes differ by less than 0.1%, we and other groups previously proposed a reference based compression algorithm to compress genomic data. However, the reference based sequence compression only works when there is a reference genome. Many large- scale sequencing projects such as metagenomics data do not have any reference genomes readily available. Therefore, we need a compression method that can be applied in these cases. This project addresses the problem by introducing a centroid based compression algorithm. The centroid based compression algorithm involves taking in large- scale next generation sequencing data and clustering similar sequences into groups. Within each group a centroid sequence is identified, and the differences that each sequence has from its respective centroid sequence is encoded. Results show that the method is advantageous when there exists many redundant sequences within the dataset in particular, the high coverage nature of next generation sequencing data and meta- genomics data. The framework developed here is for a general- purpose compression pipeline that can be theoretically applied to many cases. Introduction DNA sequence data is accumulating at an amazingly fast speed. For example, GenBank, one of the largest sequence databases, when first officially released in 1982, had only 606 sequences and a total of 680,338 bases. In contrast, its latest release in December 2012 contains more than 161 million sequences with over 148 billion bases. Shown in Figure 1, the number of bases stored in GenBank from 1982 to present increases exponentially with a doubling time of about 18 months. Whole genome sequencing (WGS) data was first made available in It is released independently from the regular GenBank updates and had 172,768 genomic sequences in over 692 million bases. Again, comparing it to its latest release in December 2012 shows a stark increase to over 92 million sequences containing more than 356 billion bases. With the remarkable improvements in DNA- sequencing technologies outpacing Moore's Law ( tremendous challenges have been generated in all aspects of sequence data handling.

2 1,000,000,000,... Bases 10,000,000, ,000,000 1,000,000 10, Figure 1. The number of bases stored in GenBank from 1982 to present, doubling at approximately every 18 months (the blue line). The red line shows the number of bases resulted from whole genome sequencing projects (WGS data). The WGS data is independent from the GenBank release. Adapted from One immediate challenge is to be able to store the massive sequence data in a compact manner and to be able to transfer the data from central sequence repositories or sequencing facilities to individual research labs in an efficient and timely manner. Clearly, there is a desperate need for an efficient way to store various biological data, one of which being DNA sequence data. Shown in Figure 2, from disk storage per US dollar increased exponentially, doubling every 14 months, while the number of base pairs per US dollar also increased exponentially, with a doubling time of 19 months from From 2004 to present the sequencing doubling time drastically slashed to every 5 months due to next generation sequencing technology. Therefore, simply increasing disk space is no longer a viable solution to this data tsunami and efficient sequence data compression algorithms are needed now more than ever to reduce the scale of this problem. Figure 2. Comparison of disk storage price with DNA sequencing cost. Blue squares denote costs of disk storage (MB/$) during , showing an exponential growth with a doubling time of about 36 months. Red triangles denote DNA sequencing costs (base pairs/$), showing an exponential growth with a doubling time of

3 about 19 months (yellow line) during , and down to less than 6 months thereafter due to the next generation sequencing (NGS) technology. Adapted from the GB paper ( ). There has been much effort in developing efficient compression techniques to store DNA sequences. Current compression programs that have been applied or developed to compress genetic data fall into two major categories: one is general- purpose compression and the other is specifically for genetic data. Some of the commonly used general- purpose compression programs are gzip and bzip2. Bzip2 uses the Burrows- Wheeler transform algorithm to compress files, and it is a lossless compression technique that compresses independently of all the files. Gzip uses the Deflate algorithm (a lossless data compression algorithm combining the LZ77 algorithm and Huffman coding) to compress files and is readily available on all machines with Unix/Linux operating systems. Examples for programs that have been developed specifically to compress sequence files include DNACompress (1), GenCompress (2) and Quip (3). Programs that are designed to compress genetic data can be classified by the types of files that they compress. For example, some programs focus on compressing the FASTQ files, whereas others such as SAM tools compress the mapped results. Also, depending on whether the compressed files lose information or not, compression algorithms can be also classified into two classes: lossy compression and lossless compression. The general- purpose compression tools are mostly lossless and specific- purpose compression tools can be both lossy and lossless. Recently, a few research groups including ours proposed reference- based compression algorithms to compress large- scale genomic data such as humans (4). The idea is to compress human genomes by encoding only their differences with a reference genome. The motivation for encoding only the difference comes from the fact that about 99.9% of any two human genomes are identical to each other. Therefore, a delta (difference) representation that encodes the differences between two human genomes can be quite small. Although a reference sequence is required to retrieve the information from delta representations, a higher compression ratio is achieved by amortizing the cost over many genomes. For example, using the algorithm of Brandon et al. (5), a 433- fold level of compression can be achieved with an appropriate reference sequence for the data set of 3615 mitochondria genomic sequences, which is significantly better than previous work that compresses single genomic sequences. Similarly, the compression algorithm proposed by our group achieved a compression ratio of 98.8% for the data set of 5473 mitochondria genomes (6). However, despite the great advantage that reference based compression algorithms show over direct compression of entire sequence data, they are only applicable when there is a reference genome/sequence readily available. In many large- scale sequencing projects, the reference genome is not known beforehand. For example, the sequencing of the cow rumen metagenome produced about 280 billion base pairs of DNA sequences and these sequences might come from more than 400 microbes whose genomes are unknown (7). Thus, a general- purpose compression mechanism that can also make use of reference based compression idea would have advantageous over both the general- purpose programs that do not make full use of biological data features and also the more specific programs as they tend to have limited usage. This project develops such a framework for a general- purpose compression pipeline that borrows the idea of reference based compression and address the lack of reference

4 sequences by introducing the use of centroid sequences as references. The centroid based compression pipeline includes four main procedures, clustering, centroid sequence construction, and difference determination and difference encoding. Data will be first pre- processed and clustered into groups. For each group, a centroid sequence will be constructed. Next all sequences in the same group will be compared to their respective centroid sequences and differences will be determined. Finally, the centroids and differences will be encoded by Huffman encoding. Our analyses show that the framework for a general- purpose compression pipeline is promising in addressing the situations when there is no reference genome available yet show the full advantage of reference based compression algorithms. Overview Methods Figure 3. The flowchart of the centroid based compression pipeline. The centroid based compression pipeline, shown in Figure 3, can be divided into four stages: preprocessing and clustering, centroid sequence construction, difference determination, and difference encoding. First, a set (or database) of sequences are preprocessed and clustered into groups. Second, a centroid sequence is constructed for each group. Third, differences between each sequence and their centroid sequence for the same group are extracted. Finally, the differences are efficiently compressed using Hoffman encoding. Data preprocessing and clustering Since the purpose of this work is to develop a framework for a general- purpose sequence compression pipeline, the most common format of sequence presentation, the fasta format, is considered. Other formats can be easily converted to the fasta format. For the experiments, DNA sequences are used, but the framework can be used to extend easily to protein sequence compression. As depicted in Figure 3, a FASTA ID mapper was written in Perl to extract the sequence ID information and the mapping information, which is stored separately in an ID mapper file. Next, the reformatted sequence file is fed into the clustering program CD- HIT (8). CD- HIT was chosen to be incorporated into the pipeline for three

5 reasons. First, CD- HIT is very fast at clustering. Second, CD- HIT can handle very large databases such as the NR database (the non- redundant nucleotide sequence database in NCBI). Third, CD- HIT actually contains a suite of programs that perform a number of tasks that are needed in the compression pipeline. Therefore, CD- HIT is an ideal tool that can be used to develop the framework for a general- purpose sequence compression pipeline. CD- HIT uses a greedy clustering algorithm as implemented by Holm and Sander (1998). Basically, sequences are first sorted in order of decreasing length. The longest sequence becomes the representative of the first cluster. Then, each of the remaining sequences is compared to the representative of each existing cluster. If the similarity with any representative is above a given threshold, it is grouped into that cluster. Otherwise a new cluster is defined with this sequence as the representative. CD- HIT- 454 is a special version of CD- HIT created in 2010 and was designed to cluster reads of artificial 454 duplicates (9). It achieves this by allowing more memory to be allocated to search and takes certain heuristics into account. The major difference between the two sister programs is that CD- HIT performs much better when clustering at lower thresholds of sequence similarities (0% to 70%) while CD- HIT- 454 performs much better when clustering at higher thresholds of sequence similarities (80% to 100%). It should be noted that the clustering of sequences is typically the most time- consuming component of the pipeline. Centroid sequence construction Once the clusters are established, consensus sequences are then constructed using CD- HIT s cdhit- cluster- consensus alignment program. CD- HIT s aligner aligns multiple sequences from each group using CLUSTALW (9, 10) and determines a consensus sequence from the resulting multiple sequence alignment. Briefly, the consensus sequence is constructed as follows. If the column of a multiple sequence alignment has only one character, the character is used as the consensus character. For column positions that have conflicting characters, the quality scores of each base are taken into account to compute an adjusted frequency for each character {'A','C','G','T', - (gap)}, and the dominate character that has a frequency >= 0.5 is used as the consensus character, otherwise, N is used as the consensus character. The consensus sequence thus calculated is then used as the centroid sequence for each group with the only exception for the character N. When the consensus has N, we will examine the column and pick out the most common nonn character as the consensus character to save space. In cases where there is only one sequence in the group, the group is considered to be a singleton and the one sequence is considered to be the centroid sequence of the group. Difference encoding and difference decoding This step involves encoding of the differences that each sequence has from their respective centroid sequence. Experiments were done to examine the frequencies of A, T, G, and C in randomly selected sequences in NCBI. A, T, and G have been found to be the more commonly occurring bases than C, and thus two bits were used to encode these three bases (A: 00, T: 01, and G: 10), whereas three bits, 110, were used to encode C. In addition, in the case of consensus sequence encoding, an END character, encoded in bits 111, is used to mark the end of the consensus sequence. For correction characters or differences from the consensus, there is an additional gap character encoded in four bits 1111 and N encoded in 1110.

6 Figure 4 shows an overview of the steps for extracting differences from the consensus sequences for each individual sequence and encoding the differences. Figure 5 shows an overview of the steps for difference decoding. Basically, a group of sequences are encoded by their consensus sequence and the differences they have from the consensus sequence. For a group of sequences, their consensus sequence is first encoded by the bit scheme for consensus encoding. The END character is used to mark the end of the consensus sequence. Then the three sequences are encoded using the scheme for correction characters. Correction characters, i.e. differences from the consensus sequence, are encoded similarly with two additional characters N and - representing the unknown character and gap character. Binary encoding is used to mark the sequence name. Then the relative position of the current correction to the previous correction character is binary encoded followed by encoding of correction types and the actual correction character. There are two types of corrections, in place correction (encoded by 0) and insertion correction (encoded by 1). If no more correction is needed, a binary encoding of 0 is used to mark the end of the correction sequence or the actual sequence. Figure 4. An illustration of how differences from consensus sequences are encoded. Results and discussion Five datasets were used to examine the proposed centroid based compression pipeline. The first one is from human genomic sequences obtained by Illumina sequencing of 22,012,277 reads with each read around 74 bases. The other four datasets are all from environmental metagenomic projects where the centroid based compression methods seem to be particularly needed for compressing millions and billions of reads with unknown and diverse origins. The first environmental metagenomic sequences were sampled from Chesapeake Bay, containing a quarter plate of the 454 sequencing (254,852 reads with ~400 bases per read). The second data is also generated for Chesapeake Bay, with Sanger sequencing of 20,140 reads at ~710 bases per read (11). The third one was sampled from Tampa Bay, containing a quarter plate of the 454 sequencing of 208,244 reads with ~288

7 bases per read. And the last metagenomic data was sampled from for Deep Sea Hydrothermal Vent, containing the Full Plate of the 454 sequencing of 700,278 reads with ~383 bases per read. Figure 5. An illustration of how encoded sequences and differences are decoded. For this pilot work, clustering efficiency and compression efficiency are the two main aspects that were examined. Experiments show that the clustering step is the most time- consuming step for the entire compression pipeline. Many heuristic clustering algorithms such as USEARCH (12) have been proposed to improve clustering speed. Clustering programs such as this can be potentially incorporated into the compression pipeline. Figure 6 shows the relationship between the similarity cutoff values used by the CD- HIT clustering program and the percentages of reads that are included in nonsingleton clusters (i.e. cluster sizes >=2). For all five datasets, the percentages of reads falling into clusters remain stable for the similarity cutoff values between 80-98%, but lower abruptly once the cutoff becomes higher than 98%. The metagenomic data for Tampa Bay has the highest percentage of reads falling into clusters (~88%), in contrast, the other four datasets have low percentage of reads being clustered, especially for the metagenomic data from Chesapeake Bay Sanger sequences and Hydrothermal Vent 454 data. In these datasets, most reads belong to singletons, that is, only one read in the group, thus, the centroid based compression is not expected to save much space for these singleton groups. Analysis shows that clustering is the most time- consuming step for the entire compression pipeline. One important parameter that influences the speed of clustering is the similarity cutoff threshold used by CD- HIT. Figure 7 shows the CPU time consumed by clustering the five datasets as a function of the similarity cutoff threshold. For all dataset, it is clear that the lower the cutoff, the longer CPU it takes for clustering to complete. For example, when the cutoff ranges between 80-90%, clustering the Chesapeake Bay- 454 sequences took almost 5 hours to complete, however, the time dropped quickly after the cutoff percentage increases after 90%, and down to almost 5 minutes when it increases to 93%. For the

8 Hydrothermal Vent sequences, the clustering took about 100 hours for the cutoff range between 80-90%. As the cutoff increases beyond 90%, the CPU time quickly dropped to around 20 minutes. However, although increasing the cutoff would greatly speed up the clustering step, due to the high similarity cutoff, many clusters will tend to be singletons, which reduces the advantage of the centroid based compression algorithm. Thus, even if reducing the cutoff can be a luring option for speed up the pipeline, the cutoff score should be carefully chosen in order to achieve a good tradeoff between the speed performance and compression performance. Figure 6. The percentages of sequences that are included in nonsingleton clusters. As gzip is a commonly used data compression tool, readily available on Unix/Linux operating systems, it is used here to compare with the proposed centroid based compression algorithm. Interesting to note is that centroid based compression algorithm can be considered as a general- purpose compression tool for sequences as it can be applied to basically any sequence datasets, as long as they are in FASTA format. Table 1 shows the compression performance of the centroid based compression and gzip on the five datasets. There are about 1.6 billion bases for the human Illumina data, taking up roughly 1.6 billion bytes of disk space in its original form. After using gzip, it reduces to about of its original disk space. In contrast, the centroid compression pipeline has a compression ratio of , thus, performing better in saving disk space than gzip. This pattern also holds for Chesapeake Bay- 454 sequences and Chesapeake Bay- Sanger sequences. However, for the Tamper Bay and Hydrothermal Vent sequence datasets, gzip shows better performance in saving disk space and has a better compression than what the centroid compression can achieve. This is mainly due to the fact that for Tamper Bay sequences, most sequences turn out to be singletons, reducing the advantage that the centroid based compression has over the general- purpose compression. Future development should include separate compression strategies for singletons so that the maximum gain can be achieved by a mixture of algorithms.

9 Figure 7. The CPU time that the clustering step takes as a function of identity threshold, i.e. similarity cutoff score. Left panel shows for all the datasets. Right panel shows the zoomed- in version for the four datasets. Table 1. Performance comparison of the centroid based compression method with gzip Dataset Bases Size (Bytes) gzip Compression Ratio Centroid Compression Ratio Partial Human 1,633,582,881 1,655,595, Chesapeake Bay 454 Chesapeake Bay (Sanger) 101,826, ,080, ,312,931 14,561, Tampa Bay 59,975,867 59,975, Deep Sea Hydrothermal 268,822, ,522, Conclusions and future work The fast development in bio- sequencing technology has created many excitements in life sciences but at the same time, bring about many challenges that are unprecedented, ranging from the demand for efficient mechanisms of data storage, data transfer and communication, data visualization, processing, and analyses, to results presentation and interpretation. For data storage, much work has been done for efficient algorithms to compress biological data, among which the delta compression algorithm, i.e., encoding only

10 the differences from a reference sequence, has the best performance as it greatly reduces the amount of data needed to be stored. However, the method only applies to cases where a reference sequence or genome is available or can be readily computed. The current work develops a framework that extends the idea of delta compression, and addresses the situations when there are no reference sequences available. In fact, it is a general- purpose sequence compression framework that can be used to compress any sequence databases or datasets. Future work includes extensive studies of various aspects of the compression pipeline, such as exploring parallelized version of various clustering algorithms to speed up the compression process and efficient mechanisms such as arithmetic encoding (13) to encode consensus sequences and differences. Acknowledgements This project is funded by NSF Award No. OCI References 1. X. Chen, M. Li, B. Ma, J. Tromp, DNACompress: fast and effective DNA sequence compression. Bioinformatics 18, 1696 (Dec, 2002). 2. X. Chen, S. Kwong, M. Li, A compression algorithm for DNA sequences. Ieee Engineering in Medicine and Biology Magazine 20, 61 (Jul- Aug, 2001). 3. D. C. Jones, W. L. Ruzzo, X. Peng, M. G. Katze, Compression of next- generation sequencing reads aided by highly efficient de novo assembly. Nucleic acids research 40, e171 (Dec, 2012). 4. M. H. Y. Fritz, R. Leinonen, G. Cochrane, E. Birney, Efficient storage of high throughput DNA sequencing data using reference- based compression. Genome Research 21, 734 (May, 2011). 5. M. C. Brandon, D. C. Wallace, P. Baldi, Data structures and compression algorithms for genomic sequence data. Bioinformatics 25, 1731 (Jul 15, 2009). 6. L. S. Heath, A. P. Hou, H. Xia, L. Zhang, paper presented at the Proc LSS Comput Syst Bioinform Conf, M. Hess et al., Metagenomic discovery of biomass- degrading genes and genomes from cow rumen. Science 331, 463 (Jan 28, 2011). 8. W. Li, A. Godzik, Cd- hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658 (Jul 1, 2006). 9. B. Niu, L. Fu, S. Sun, W. Li, Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC bioinformatics 11, 187 (2010). 10. J. D. Thompson, D. G. Higgins, T. J. Gibson, CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position- specific gap penalties and weight matrix choice. Nucleic acids research 22, 4673 (Nov 11, 1994). 11. S. R. Bench et al., Metagenomic characterization of Chesapeake Bay virioplankton. Applied and environmental microbiology 73, 7629 (Dec, 2007). 12. R. C. Edgar, Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460 (Oct 1, 2010). 13. I. H. Witten, R. M. Neal, J. G. Cleary, Arithmetic Coding for Data- Compression. Communications of the Acm 30, 520 (Jun, 1987).