DNA Base Data Hiding Algorithm Mohammad Reza Abbasy, Pourya Nikfard, Ali Ordi, and Mohammad Reza Najaf Torkaman

DNA Base Data Hiding Algorithm Mohammad Reza Abbasy, Pourya Nikfard, Ali Ordi, and Mohammad Reza Najaf Torkaman Advanced Informatics School (AIS), International Campus, Universiti Teknologi Malaysia (UTM), Kuala Lumpur. {ramohammad2, npourya2, oali2, rntmohammad2}@live.utm.my ABSTRACT Proposing an algorithm by using software point of view for the purpose of implementing data hiding based on DNA sequences to increase the complexity is the main target of this paper. The implementation of algorithm is applicable because of some interesting characteristics of DNA sequences. Both of DNA s features and Binary Coding Technology beside Complementary Pairing Rules are explained through the paper. Since, both of a secret message (M) and DNA reference sequence is needed. Data hiding is started by applying three different and separate steps to prepare M. The receiver will apply the process of identifying and extracting the original message (M) which had been hidden in DNA reference sequence. As the security of the algorithm is crucial part, it will appear in security issue s section. KEYWORDS DNA sequence; data hiding; DNA base pairing rules; complementary pairing rules; DNA binary coding. 1 INTRODUCTION In order to protect data through the unsecure networks like the Internet, using various types of data protection is necessary. One of the famous ways to protect data through the Internet is data hiding. Because of the increasing number of Internet users, utilizing data hiding or Steganographic techniques is inevitable. Eliminating the role of the intruder and authorizing the receiver, are eventual goals of these techniques. Thus, the role of data hiding has become more eminent nowadays. Before employing biological properties of DNA sequences, usually embedding a secret message into the host images was the traditional way of data hiding [1, 2, 3, 4, and 5]. Unfortunately, this had some liabilities. The most important ones was the detection of the distortions of the image when the host image changed to some degrees. This spot was the best spot to start the wholly detection of the secret message through the image. With advent of biological aspects of DNA sequences to the computing areas, new data hiding methods have been proposed by researchers, based on DNA sequences [6, 7, 8, 9, and 10]. The key portion of their work is, utilizing biological characteristics of DNA sequences. In order to convert binary data into amino acids as a DNA sequence, the base pairing rules must be used. Synthesizing nucleotides in real environment (biology) is done in constant rules: 183

Purine Adenine (A) always pairs with the pyrimidine Thymine (T) Pyrimidine Cytosine (C) always pairs with the purine Guanine (G) Always, those rules are done naturally because the opportunities to synthesize hydrogen bonds between A and T (two bonds), and also between C and G (three bonds) is different, basically (hydrogen bonds have been shown with dotted lines in Fig1). These concepts are named Watson-Crick base pairing rules when they discovered DNA s fundamental structure as a Noble prize [13]. Figure 1. Synthesizing basic nucleotides naturally In binary computing area, it is possible to change the natural rules by own decision. For example, in biology A is synthesized to T while we can assume A to C or A to G, and so on, as we prefer. Increasing the complexity of the algorithm is the main purpose of the changing the rules. In this paper, the authors consider A=00, T=01, C=10, and G=11 to convert binary message to DNA sequences. A way to increase the complexity is complementary pair rule. Complementary pair rule is a unique equivalent pair which is assigned to every nucleotides base pair. As an example, complementary rule is applied on strand in below [16]: Complementary rule: ((AC) (CG) (GT) (TA)) DNA strand: AATGC Applying complementary rule on DNA strand: CCATG Increasing the complexity is the main purpose of using those rules, in this paper. It means that, finding the original message by intruder needs extra calculations because there are four basic alphabets therefore four likelihood of complementary rule for every DNA sequences. So, the final number of possible those rules are 4 3 2 1=24. On the other hand, the possibility to happen a correct guess is. Extra information and findings for some basic definitions can be obtained in molecular biology reference books [11, 12, and 13]. 2 RELATED WORKS For the purpose of clarifying the algorithm in current paper, introducing some backgrounds of the knowledge is necessary [8, 11, 12, and 13]. The most important part of each DNA base data hiding algorithm is, manipulating four letters which has been called as nucleotides in biology. The letters are A, C, G, and T. Any composition from them will make a sequence. For instance, two DNA sequences have been appeared in [16]. They mentioned sequences from 184

European Bioinformatics Institute (which is known as EBI Database) [15] for the purpose of extracting DNA sequences of Litmus and Balsaminaceae. So, Litmus with 154 nucleotides and Balsaminaceae with 2283 are shown in below, respectively: Litmus: ATCGAATTCGCGCTGAGTCACAA TTCGCGCTGAGTCACAATTCGCGC TGAGTCACAATTGTGACTCAGCCG CGAATTCCTGCAGCCCCGAATTCC GCATTGCAGAGATAATTGTATTTA AGTGCCTGCTCGATACAATAAACG CCATTTGACC. Balsaminaceae: TTTTTATTATTTTTTTTCATTTTTT TCTCAGTTTTTAGCACATATCATT ACATTTTATTTTTTCATTACTTCTA TCATTCTATCTATAAAATCGATTA TTTTTATCACTTATTTTTCTAATTT CCATATTTCATCTAATGATTATATT ACATTAAAGAAATCG. Even though data can be shown by DNA nucleotides, but representing to binary was emerged in 1999 by Rauhe and et al. [17]. They represented numbers by using binary DNA sequences in figure 1. The creativity of their work was in how they could separate binary sequences from each other. All binary sequences are illustrated in form of s{0 1}e. An arbitrary DNA bits has been concatenated between two terminations, s and e, as start and end, respectively. Annealing and ligation are the way of concatenating DNA bits between s and e. Both of terminations and DNA bits have been made by annealing complementary oligonucleotides. In order to concatenate on both sides, having sticky ends (A, Ā, X, Ȳ) are necessary. The sticky end A(Ā) acts as a variable for correct concatenation of bits and terminations. For subsequent cloning, the sticky ends X and Ȳ should be used. In 2000 Leier and et al. [8] brought a robust technique by utilizing a special key strand. They called it, primer. Primer has key role to decrypt a coded strand. Using a public DNA strand was utilized as a reference sequence in their DNA based encryption technique. In this scheme, the receiver must also be informed about the reference sequence. Namely, the receiver will receive a selected primer and an encrypted strand. The intruder is not able to decrypt the binary data without knowing about both of primer and reference strand, certainly. A primer is a complementary subset from a sort of DNA strand. Normally, the primer is called a short substring. For example, assume S is a DNA strand: Figure 2. Assembly of DNA binary strands [8]. S= ATGCTTAGTTCCATCGGAGACTAATGGCCTA and two primers ATCAA and GATTAC. So ATCAA and GATTAC are complementary substring of TAGTT and CTAATG, correspondingly. Definitely, a complementary rule is needed to handle the manipulations, correctly. For this reason, they defined a 185

complementary rule which is A-T, T-A, C-G, and finally G-C. Accomplishing those states in biology is done by a series of chemical mechanisms to combine primers with a DNA strand. Indicating the right position of the primers also is shown by a fluorescent chemical substance. When hybridization occurs among primers and substrings of the reference DNA reference sequence, it will become bright, obviously. Except bright places, remaining are dim sections. The exact message of bright portion is binary data 1 and naturally, a dim portion is referred to the binary data 0. So, according to the above, the proper output of the hybridization is: Therefore, the final secret message in form of binary is 01010. If the sender prefers to complicate that proposed technique, it can send one part of primers to the legitimate receiver. For the purpose of recovering primers, the receiver must apply PCR. PCR is another chemical scheme which has been known as Polymerase Chain Reaction, scientifically. It can recover primers correctly [16]. Considering a certain substring from DNA sequence as a character was proposed by Peterson in 2001 [9]. By substituting three successive nucleotides as a character, he could hide data in DNA sequence, appropriately. For instance, A =GGC, B =ATG, and so on. So, for this scheme we face to 64 symbols that can be possibly encrypted. The main liability of this method is the frequency for both of E and I in an English message. Because of those holes, an intruder can apply a cryptanalysis technique base on frequencies of the most repetitive letters in English and subsequently extracts the secret message, simply. The next scheme in 2002 [18] needs some backgrounds before explaining. An arrangement of nucleotides determines a protein. The responsibility of the proteins is almost every activity in cells. Transcription is the process by which RNA is created, an intermediary copy of the instruction contained in DNA. Naturally, RNA has four bases. They are adenine (A), cytosine (C), uracil (U) and guanine (G). The RNA copy is related to mrna (abstract of messenger RNA) by throwing away the coming among sequences of RNA. On mrna, each codon has three nucleotides for the purpose of representing which amino acid must be assigned in the next position. All different amino acids have been listed in Table 1. They are: Phe, Leu, Ile, Val, Ser, Pro, Thr, Ala, Tyr, His, Gln, Asn, Lys, Asp, Glu, Cys, Trp, Arg, Met and Gly. Each codon binds a set of three nucleotides into an anticodon which is called trna molecule (transfer RNA). The responsibility of the trna is, translating nucleic acid into protein. Totally, there are forty separate trna molecules which each one has a binding for one of amino acids, as represented in Table 1. 186

Apposite trna is bound to codon on the mrna. When the STOP codon was seen, the translation is completed, and then a protein is released. Because of the codon redundancy, Shimanovsky et al. [18] utilized from that feature to hide data in mrna. In general, each mrna codon has been composed of three nucleotides. The likely nucleotides are: U, C, A, and G. Therefore, there are many different probable combinations to form each mrna codon, whereas in Table 1, there are only twenty different amino acids. It is because of some codons might be mapped to the same amino acids. For instance, the codons UUA, CUU, CUA and UUG are mapped to the same amino acid Leu. 3 PROPOSED METHOD This section is divided into two phases. The first one is, embedding and the second one is, extracting the original message. At the end of the method, a snapshot of the project will appear. 3.1 Phase1: Embedding Secret Message In order to explain embedding phase, separating the phases into some successive and vivid sub-phases, is the best way of proposing current method. In below, sub-phases have been shown, respectively. Feasibility of embedding information in the mrna codon is because of the redundancy. For example, if the codon must be encrypted with UUA, while the secret message if four, it is possible to use the codon UUG in order to replace the original. It is feasible because UUG is the fourth codon of all the codons whose mapping amino acid is Leu. Even though previous replacement does not have any effect on the results of transcription, but it modifies the nucleotides of the original sequence, that might probably trigger unidentified consequences. Although the previous beginning schemes utilize distinct biological features for hiding data in DNA sequences, they are not cost-effective and efficient to employ. In this paper, we apply a scheme to hide data in DNA strands based on a software point of view to improve them. M= Message Convert Binary to DNA Nucleotides M =DNA Sequence from M Applying Complementary Rules on M M = New Form of M Finding Index of each Couple of Nucleotides in DNA Reference Sequence M =Secret Message Figure 3. Phase 1: embedding secret message Obviously, there is an original message M which the sender decides to send via a network to another client who is called receiver. So, there are three sub-phases to provide the final form of M which is M and send it to receiver. The first sub-phase is, converting by DNA base pairing rules. The product is M. M contains nucleotides sequences. By applying DNA base pairing rules, the 187

message can convert from binary to DNA sequence. Not only DNA base pairing helps to encrypt the message from binary to DNA sequence but also it is applied to decrypt the secret message to original one, truly. The next (second) sub-phase is, applying complementary rules. Increasing the complexity is the real and exact purpose of this step. By applying the complementary rules, the new form of the M which is M emerges. Now, M is appeared. As mentioned before, both of sender and receiver have a DNA reference sequence from a large number of possibilities base on EBI [14] or NCBI [15] database. It means that, they have selected the same DNA reference sequence, exactly. The exact role of the third sub-phase is, extracting the index of each couple nucleotides in DNA reference sequence, numerically. When all the indexes have been extracted, M has been made, properly. M is precisely the secret message with some changes through the embedding phase. Now, sender can send the message (M ) to receiver. Clarification of the current phase is continued by demonstrating an example, step by step. In this example, assume original message M=100111000011 should be sent to the receiver. DNA Reference Sequence: AT 1 CG 2 AA 3 TT 4 CG 5 CG 6 CT 7 GA 8 GT 9 C A 10 CA 11 AT 12 TC 13 GC 14 GC 15 TG 16 AG 17 T G 18 AA 19 CC 20 M=100111000011 Sub-phase 1 (A= 00, T= 01, C= 10, G= 11): M = CTGAAG Sub-phashe 2 ((AC) (CG) (GT) (TA)): M = GATCCT Sub-phase 3 (Indexes): M =8137 Now, embedding phase is finally completed. Then, sender sends 8,13,7 to the receiver. In the next section, the receiver will apply the extracting phase for extracting the original message by using three consecutive phases. 3.2 Phase2: Extracting Original Message Now, receiver takes the secret message in form of some numbers. For the purpose of extracting the original message from DNA reference sequence, phase two with its sub-phases will extract the original message, correctly. M =Secret Message Finding Index of each Couple of Nucleotides in DNA Reference Sequence M = Previous Form of M Applying Complementary Rules on M M =DNA Sequence from M Convert Binary to DNA Nucleotides M= Message Figure 4. Phase2: extracting original message So, the first sub-phase manipulates the M. Because of the nature of the secret message which is some sorts of numbers (exact positions (indexes) of the original message on DNA reference sequence), extracting the message starts by finding the indexes on DNA reference sequence one by one according to the numbers which sender has sent in form of the current secret message. M is the exact product of the first sub-phase. Consequently, the second sub-phase applies complementary rules on M in order to extracting M, correctly. The importance of the M is the form of it. M is the last form of message, based on DNA nucleotides. Converting the M to the M is the third sub-phase. 188

Transforming from DNA nucleotides to the binary is the responsibility of the last sub-phase. Now, the receiver has truly extracted the original message M. Those steps are demonstrated through the example in below: DNA Reference Sequence: AT 1 CG 2 AA 3 TT 4 CG 5 CG 6 CT 7 GA 8 GT 9 C A 10 CA 11 AT 12 TC 13 GC 14 GC 15 TG 16 AG 17 T G 18 AA 19 CC 20 M =8137 Sub-phase 1 (Indexes): M = GATCCT Sub-phase 2 ((AC) (CG) (GT) (TA)): M = CTGAAG Sub-phase 3 (A= 00, T= 01, C= 10, G= 11): M=100111000011 So, the receiver extracted the original message, accurately by using a simple algorithm. In the next section, security and liabilities of the algorithm will inspect, briefly. 3.3 Presenting Results The final picture of the design phases of the prototype is presenting results. The GUI contains two separated parts. The first (left) is, embedding secret message and second (right) is, extracting secret message. Through the each part, there are some sub-phases to complete the intention of the parts. In the left side, in order to complete the embedment, first, original message decoded to the binary. The second step is, converting binary message to the DNA strands by rules: A=00, T=01, C=10 and G=11. The results of this step can be seen in first output textbox. As we mentioned, we use complementary rules to increase the complexity by DNA basic synthetic rules which it says only A-C, C-G, G-T and T-A can have a chance to synthesize together. Then, DNA reference sequence is seen which both of Alice and Bob have shared between themselves. The usage of DNA reference sequence is, finding the index of each couple of nucleotide. At the end of the embedment phase, the secret message can be seen in group box of the secret message. Indexes are represented by numbers. Then Alice clicks the button send to Bob, for sending the secret message to Bob. Now, extraction phase is started. In side of extraction, Bob must extract the secret message. So, Bob clicks the extraction button and the software starts to recover the secret message. First, indexes are found through the DNA reference sequence (which had shared between Alice and Bob) and corresponding to each index, its couple nucleotide is extracted. The output is a DNA sequence. By applying complementary rules, the previous DNA sequence is converted to the right DNA sequence after binary coding in embedment phase. After that, for each nucleotide, the corresponding binary symbol put to the binary original message textbox. As we know, every eight bits represents a character. So, by separating all the eight bits, the final characters will appear correctly. 4 SECURITY ISSUES In terms of security, each intruder must be aware from the following information, correctly. Without this fundamental information, possibility of 189

extracting original message is near to zero, scientifically. They are: DNA reference sequence: there are 163 million DNA reference sequence on EBI database. Therefore, the likelihood of making a doing well conjecture by attacker is. Binary coding rule: as mentioned, the sender is free to select any equivalent binary form for every nucleotide. It means that, A can be 00, 01, 10, or 11 ; C can be 00, and so on. In other words, all the binary coding rules are 4 3 2 1=24. So, the likelihood of making correct guess by attacker is. Complementary pairing rule: like binary coding rule, there is 4 3 2 1=24 complementary alphabet among basic nucleotides. Therefore, the possibility of making successful attack is. Eventually, the final probability of making a correct and successful guess by attacker is. 5 CONCLUSION At the end of the project, it is sensible which data hiding with DNA sequences is a potential and promising way of hiding secret message within DNA sequences. Different type of environment to manipulate the data was the best part of the project. It means that, regardless the basic concepts of Steganography (robustness, capacity, visibility, and performance), utilizing biology concepts could bring new properties to solve ancient difficulties of algorithm such as complexity besides performance. At the beginning of the project, the most important question which reader thought about it was how it is possible to convert binary data to biology and genetics area? As we showed, DNA coding technology helped us to transform binary bits to DNA nucleotides well. Considering DNA characteristics brings new ideas in data hiding. DNA sequences are potential to implement new data hiding techniques or even transforming previous schemes to new one. In this paper, a reference DNA sequence has been shared between sender and receiver. Not only this DNA reference sequence can be retrieved from EBI [15] or NCBI [14] databases but it can also be simply selected from any database. Therefore, by considering any sort of database, there are 163 million targets to select it. Virtually, guessing the correct DNA sequence by attacker is unachievable. The crucial feature of the DNA sequences is visibility. Finding secret message in a DNA sequence is difficult because the visibility of the sequences is very low. Therefore, attacker cannot find out whether this sequence is a fake or not. In comparison with previous techniques, such as in images, not only implementing this method is not difficult but also it is formidable to detect, as well. 190

6 REFERENCES 1. C.C. Chang, C.C. Lin, C.S. Tseng, W.L. Tai, Reversible hiding in DCT-based compressed images, Information Sciences 177 (2007) 2. C.C. Chang, W.C. Wu, Y.H. Chen, Joint coding and embedding techniques for multimedia images, Information Sciences 178 (2008) 3. C.H. Huang, J.L. Wu, Fidelity-guaranteed robustness enhancement of blind-detection watermarking schemes, Information Sciences 179 (2009) 4. H.H. Tsai, D.W. Sun, Color image watermark extraction based on support vector machines, Information Sciences 177 (2007) 5. H.W. Tseng, C.P. Hsieh, Prediction-based reversible data hiding, Information Sciences 179 (2009) 6. C.C. Chang, T.C. Lu, Y.F. Chang, R.C.T. Lee, Reversible data hiding schemes for deoxyribonucleic acid (DNA) medium, International Journal of Innovative Computing, Information and Control 3 (2007) 7. C.T. Clelland, V. Risca, C. Bancroft, Hiding messages in DNA microdots, Nature 399 (1999) 8. A. Leier, C. Richter, W. Banzhaf, H. Rauhe, Cryptography with DNA binary strands, BioSystems 57 (2000) 9. I. Peterson, Hiding in DNA, Muse (2001) 10.B. Shimanovsky, J. Feng, M. Potkonjak, Hiding data in DNA, in: Revised Papers from the 5th International Workshop on Information Hiding, Lecture Notes in Computer Science 2578 (2002) 11.B. Alberts, D. Bray, J. Lewis, M. Raff, K. Roberts, J.D. Watson, Molecular Biology of the Cell, Garland Publishing, New York & London, (1994) 12. A.L. Lehninger, D.L. Nelson, M.M. Cox, Principles of Biochemistry, Worth, New York, (2000) 13.J. Watson, N. Hopkins, J. Roberts, J. Steitz, Molecular Biology of the Gene, fourth ed., Benjamin Cummings, Menlo Park, CA, (1987) 14. National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/ 15. European Bioinformatics Institute, http://www.ebi.ac.uk/ 16. H.J. Shiu, K.L. Ng, J.F. Fang, R.C.T. Lee, C.H. Huang, Information Sciences, Volume 180, Issue 11, 1 June( 2010) 17. Rauhe, H., Vopper, G., Banzhaf, W., Howard, J.C., (1999) 18. B. Shimanovsky, J. Feng, M. Potkonjak, Hiding data in DNA, in: Revised Papers from the 5th International Workshop on Information Hiding, Lecture Notes in Computer Science 2578 (2002) 191

Table 1. The mapping of codon to amino acid [18]. UUU Phe UCU Ser UAU Tyr UGU Cys UUA Leu UCA Ser UAA Stop UGA Stop CUU Leu CCU Pro CAU His CGU Arg CUA Leu CCA Pro CAA Gln CGA Arg AUU Ile ACU Thr AAU Asn AGU Ser AUA Ile ACA Thr AAA Lys AGA Arg GUU Val GCU Ala GAU Asp GGU Gly GUA Val GCA Ala GAA Glu GGA Gly UUC Phe UCC Ser UAC Tyr UGC Cys UUG Leu UCG Ser UAG Stop UGG Trp CUC Leu CCC Pro CAC His CGC Arg CUG Leu CCG Pro CAG Gln CGG Arg AUG Ile ACC Thr AAC Asn AGC Ser AUG Start ACG Thr AAG Lys AGG Arg GUC Val GCC Ala GAC Asp GGC Gly GUG Val GCG Ala GAG Glu GGG Gly Figure 5. Screenshot of Prototype GUI Showing Embedment and Extraction phases. 192