1. Introduction
With the reduction of “cloud” network computing costs, cloud computing technology is becoming more mature. In this context, users do not carry their own biological information genetic data, but instead store more biological information genetic data and services on cloud servers. However, using this method carries certain risks. Bioinformatics genetic data can be directly exposed to external attackers and staff of "cloud" network service providers, so the user's private biological information may be leaked or abused [1]. Currently, to improve the reliability of biological data stored in the manager, most servers use encryption schemes to store and manage biological data. However, with the development of science and technology and the social progress, the amount of data generated every day is increasing, and the information related to people's lives is exploding [2, 3].
The structure of this paper is as follows: The first part is the introduction and the contents of references; the second part is the research methods and materials, mainly including the homomorphic encryption algorithm and the principle of gene information coding and matching. The third part is the experiment part, which mainly compares the proposed method with the traditional encryption algorithm. The last part is the conclusion, which mainly summarizes the results of the experiment.
2. Materials and Methods
2.1 Full Homomorphic Encryption Algorithm
To improve the security of gene matching based on cloud computing, this paper uses homomorphic encryption algorithm to encrypt cloud-based genes. Firstly, the encryption principle and steps of the homomorphic encryption algorithm are analyzed in detail. After that, the overall situation of homomorphic encryption is analyzed. Finally, homomorphic encryption is applied to gene matching based on cloud computing. The detailed analysis results are as follows.
2.1.1 The principle of full homomorphic encryption algorithm
Assuming that the encryption function is EK; The decryption function is DK; The operation is α, and the plaintext is [TeX:] $$M=\left(m_0, m_1, m_2, \ldots m_i\right),$$ then the calculation formula for the full homomorphic encryption algorithm is:
Meanwhile, a homomorphic encryption scheme is mainly composed of four algorithms. The four algorithms are the key part, the encryption part, the decryption part and the additional evaluation part, respectively[4,5]. The detailed analysis results of the four parts are as follows:
· Key part (Key): The public key pk and the private key sk are generated according to the given parameter γ.
· Encryption part (Enc): Encrypt the plaintext M with the public key pk to obtain the encrypted ciphertext C.
· Decryption part (Dec): Decrypt the ciphertext C with the private key to obtain plaintext M.
· Evaluate: Outputs t input circuits C (composed of mod2 addition gates and multiplication gates) and public key pk, as well as ciphertext c corresponding to plaintext [6,7]. The extra evaluation section can be used to formulate the output of the additional evaluation section, and the output formula is [TeX:] $${ Evaluate }(p k, C, c) \text {. }$$
2.1.2 The step full homomorphic encryption algorithm
Before using the full homomorphic encryption algorithm for encryption, some homomorphic somewhat scheme needs to be constructed to change the modulo 2 operation to the modulo 4 operation, so that the 2-bit ciphertext can be encrypted at one time[8]. Let λ be the security parameter, and the specific construction process of the full homomorphic encryption algorithm is as follows.
- KeyGen (λ): A η bit key p is generated by the security parameter λ.
- [TeX:] $${ Encrypt }(s k, m)$$ : Encryption m = {00,01,10,11} yields c = m + 4r + pq . Wherein, r is the randomly generated integer of ρ; ρ represents the noise length; m is a randomly generated γ-bit integer in the encryption process.
- [TeX:] $${ Desrypt }(s k, m)$$ : = (cmodp)mod4.
The value of cmodpis noise, that is, only when [TeX:] $$m+4 r\lt \frac{p}{2}, c \bmod p=m+4 r,$$ then the decryption result obtained is correct. According to the security parameters, as long as it is a "fresh" ciphertext, the plaintext obtained after decryption by the full homomorphic encryption algorithm is always established [9]. In the above process, KeyGen (λ) represents the key generation algorithm; [TeX:] $${ Encrypt }(s k, m)$$ 22 represents the encryption algorithm; [TeX:] $${ Desrypt }(s k, m)$$ represents the decryption algorithm. To analyze the homomorphism of the full homomorphic encryption algorithm, the homomorphism of the full homomorphic encryption algorithm is verified. The specific verification process is as follows:
The above formula indicates that the ciphertext in the process of full homomorphic encryption algorithm is "fresh," so ciphertext addition and multiplication satisfy the homomorphism. However, the noise generated in the continuous operation will become larger [10].
The steps of the full homomorphic encryption algorithm are as follows:
- KeyGen (λ): A η-bit private key is randomly generated during key generation p. Let [TeX:] $$x_0=p q_0, \text { and } x_0$$ be an odd number; [TeX:] $$r_p\left(x_0\right)$$ can be divisible by 4. According to the some scheme, [TeX:] $$2 \sqrt{\tau}$$ ciphertexts generated by 0 encryption is generated [TeX:] $$b\{0,1\}, 1 \leq i \leq \sqrt{\tau}, \text {, and } x_{i, b}=p q_{i, b}+4 r_{i, b}.$$ The final public key size is [TeX:] $$2 \sqrt{\tau}$$ and [TeX:] $$p k=\left\langle x_0, x_{1,0}, x_{1,1}, x_{2,0}, x_{2,1}, \ldots, x_{\sqrt{\tau}, 0}, x_{\sqrt{\tau}, 1}\right\rangle$$
- [TeX:] $${ Encrypt }(p k, m)$$: τ dimensional vector [TeX:] $$b=\left\langle b_{i, j}\right\rangle\left(1 \leq i, j \leq \sqrt{\tau}, b_{i, j} \in\{0,1\}\right),$$ and the fixed large prime number q is randomly generated (the number of bits in p is greater than the number of digits in q). Plaintext [TeX:] $$m \in\{00,01,10,11\},$$ the expression of ciphertext c is:
- [TeX:] $${ Desrypt }(s k, c)$$ : Decrypting the ciphertext, the resulting plaintext is [TeX:] $$m=(x \bmod p) \bmod 4$$; the purpose of modularizing [TeX:] $$x_0$$ in the encryption process is to reduce the ciphertext size.
2.2 Cloud Computing based Full Homomorphic Encryption Algorithm by Gene Matching
The overall flow of cloud computing based full homomorphic encryption algorithm by gene matching is as shown in Fig. 1.
In Fig. 1, the input of the full homomorphic encryption algorithm by gene matching is divided into two parts. The first part is to collect the gene sequence input by the user, and the other part is to read the gene sequence stored in the cloud file. The two parts correspond to the gene sequence to be searched and the database gene sequence, respectively. The default search gene is a specific gene with a certain length. If the specific length range is 50, the length of the gene to be searched is also within this length range. The gene to be searched is encoded and encrypted, which is decrypted after being transmitted to the terminal. The degree of agreement between the decrypted results and the genes in the gene pool is analyzed to find the most compatible genes for gene matching. The output shows the final matching result. The various parts of the algorithm flow are analyzed in following detail.
Flow chart of full homomorphic encryption algorithm by gene matching.
3. Results and Discuss
3.1 Encryption Decryption Effect
The MATLAB program is applied to collect the gene sequence input by the user and read the code book generated by the gene sequence stored in the cloud file. Set the safety parameters of the system [TeX:] $$\gamma = 4, \eta = 64, \tau = 1028.$$ The experiment is conducted on the VMware Workstation 9.0.0 build-812388 virtual machine platform with Ubuntu 12.10 as the operating system. This platform has a 20.3GB hard drive and 772MB of memory.The dataset used in this experiment is from the open data provided by the biological information database of DNADataBank of Japan, European Bioinformatics Institute, and National Center for Biotechnology Information. It mainly consists ofserialized data of DNA molecules, ranging in length from 1,950 billion to 595 million. The personal biological information gene data segment is selected as the basic dataset of the experiment. Select fragments as experimental data.
Through analysis, the XOR operation is performed on the possible combinations of the four bases to obtain a truth. By observing Table 1, it can be seen that the XOR operation results of A and T, C and G are all 1, while any other combination cannot guarantee that the result is all 1. Demonic simulation results of encryption and decryption are shown in Fig. 2.
Fig. 2 demonstrates that after the results of the encryption process are summarized, the obtained ciphertext is 011000110011110100100100001010110111110101100010010110000001. The decrypted plaintext can be obtained by the same reason. Compared with the original plaintext binary sequence and the decrypted binary sequence, the two are almost identical, which indicates that the encryption and decryption process of the algorithm is more accurate. Meanwhile, from the security analysis in the process of encryption and decryption, it is found that only the recipient and the sender have the DNA codebook under the algorithm. Therefore, the attacker cannot select the key from the password book, which guarantees the reliability of the key, and the attacker's chance of brute force cracking is small.
Analysis of encryption and decryption effect: (a) key simulation results and (b) ciphertext decryption results.
3.2 Encryption Efficiency
The algorithm of this paper is applied to encrypt the cloud-based genes that need to be matched, as well as to count the time of encrypting gene files in different gene file fragments. In the meantime, the overall acceleration ratio in the encryption process is calculated. The encryption efficiency of the algorithm, the genetic matching based on symmetric encryption algorithm and the genetic matching based on asymmetric algorithm are compared. The comparison results are shown in Table 2.
Table 2 indicates that under the same file size of the same algorithm, as the number of gene files increases, the encryption time is gradually shortens; the overall acceleration ratio gradually increases. When the number of fragments of the gene file is the same, the encryption time of the algorithm is the shortest and the acceleration of the algorithm is relatively large. When the number of 1 GB gene files is 65, the algorithm has the shortest encryption time and the maximum overall acceleration ratio. The encryption time is as short as 80.13 ms, and the overall acceleration ratio is up to 3.6. That is to say, the encryption efficiency of the algorithm is accelerated, which can significantly improve the speed of cloudbased gene matching. This is because the homomorphic encryption algorithm designed in this paper converts gene information into binary values, directly encrypting 2-bit ciphertext at once, greatly accelerating the encryption process and improving the encryption efficiency.
Statistical results under the number of slices of different documents
3.3 Actual Matching Effect
To verify the gene matching effect in the cloud computing environment under the algorithm, a large number of genes need to be used for experimental analysis. The results of the analysis are as shown in Fig. 3.
Analysis of matching results: (a) original gene chain, (b) original gene chain DNA sequence, and (c) match result.
As shown in Fig. 3, comparing Fig. 3(a) with Fig. 3(c), it can be seen that the original gene chains all appear in the output gene chains, indicating that the algorithm can match a part of the original gene chains to the complete gene chains in the cloud computing environment. Comparing the matching results with the actual results, as shown in Fig. 3(b), it is found that the genes matched by the algorithm are consistent with the actual genes, indicating that the algorithm has high matching accuracy. This is also because the homomorphic encryption algorithm in this paper can simplify the expression of ciphertext by using binary values, and achieve high-precision matching.
3.4 Matching Performance
To detect the gene matching performance in the cloud computing environment when using the algorithm, it is required to analyze the time, energy consumption and accuracy of gene matching. The three algorithms used to compare the time and energy consumption of gene matching under different gene sequence numbers, and the results are shown in Table 3.
Results of time and energy consumption analysis of gene matching
From Table 3, as the size of the gene sequence increases, the matching time of the three algorithms is also increasing. At the same gene sequence size, the gene matching time and energy consumption of the algorithm are the least.
As shown in Fig. 4, the method in this paper is finally compared with the traditional homomorphic encryption algorithm, A* algorithm and regularized homomorphic encryption algorithm in the dataset. As the number of characters retrieved by matching increases, the matching speed of all methods is increasing. However, when the number of characters is between 600 and 5,400, the method proposed in this paper is the least time-consuming.
Status analysis of matching time increasing with the number of characters under different algorithms.
5. Conclusion
To improve the security and accuracy of gene matching, this paper studies the cloud-based homomorphic gene matching encryption algorithm. When the encrypted gene is transmitted to the matching terminal, it is decrypted. The decrypted gene is compared with the gene in the database and the binary sequence is compared to achieve the purpose of gene matching. Experimental analysis shows that the algorithm can significantly improve the safety of gene matching. Meanwhile, the algorithm has shorter encryption and decryption time in the gene matching process, which can greatly reduce the gene matching time. Through the actual test, it is found that the gene matching effect is better under the algorithm and the matching accuracy is higher. There are still some limitations in this paper, and in the future research work, two aspects should be noticed: first, the distributed computing model will be studied to further improve the efficiency of fully homomorphic encryption computing; secondly, the integration of different privacy protection principles, anonymous technology and homomorphic encryption technology will be further studied.