Plaintext attack on joint transform correlation encryption system by convolutional neural network

Linfei Chen; BoYan Peng; Wenwen Gan; Yuanqian Liu

doi:10.1364/OE.402958

1. Introduction

Recently, in order to prevent important information from being used by criminals, people have paid much attention to the information security problem, and a large number of effective information security systems have been proposed. Since Refregier and Javidi proposed the double random phase encoding (DRPE) method [1], the optical image encryption technology has also developed rapidly. Various optical cryptography systems have been proposed [2–11], such as DRPE in the Fresnel domain [3] and fractional Fourier domain [4,5], optical image encryption based on diffractive imaging and interference [6] etc. Compared with traditional DRPE, joint transform correlation (JTC) architecture has stronger security and more convenience [11]. The ciphertext of the system is a power spectrum function, which does not need to be recorded by holography in optics, nor does it require the complex conjugate of the phase cipher on the Fourier plane for decryption, thus it effectively avoids the trouble of making complex conjugate of decryption key. At the same time, the movement of the key on the input plane only changes the position of the decryption image on the output plane without influence the quality of the decryption image, which greatly enhances the practicability of the method.

The concept of neural network was proposed as early as last century, but due to constrained by computer hardware, it is difficult to realize the application of neural network in reality. In recent years, with the improvement of computer hardware level and the discovery that GPU can accelerate the calculation of the neural network, the neural network has developed rapidly again, and many neural network systems have been proposed successively [12–18]. Among them, convolutional neural networks (CNN) can extract features from the target image and learn them, which is widely used in image classification [15,16], image reconstruction [17,18] and other fields. At present, neural networks have also been applied to attack on optical image encryption systems. Hai et al. attacked DRPE through neural networks [19], and people have also attacked image encryption systems based on beam interference, diffraction and the principles of computational holography through neural network [20–23]. Compared with the previous attack methods [24–30], neural network has more advantage, which will not be limited by the complex phase retrieval algorithm and the difficulty in optical encryption keys retrieval. The optical image encryption methods and their attacks develop together. The research on the attacks of the optical encryption systems can help people to find the loopholes in it, improve the encryption scheme constantly, and finally explore new encryption methods.

In this paper, it is the first time to use CNN to attack JTC architecture. Compared with the existing literatures [19–23], the neural network system proposed in this paper adopts the sigmoid activation function and dropout layers in the design, which makes the attack have higher efficiency and can save attack time. With the same amount of data for training, the training time is shorter. And training fewer batches can get good convergence as well as better noise resistance. Barrera et al. have attacked the JTC architecture using traditional phase recovery algorithms and chosen-plaintext attacks [28]. In the same year, they used the known-plaintext attack method to attack JTC encryption system [29]. The attack architecture by using CNN method proposed in this paper has better quality than above two systems, whose decrypted image has high peak signal-to-noise ratio (PSNR) and almost no obvious noise. Zhang et al. attack JTC [30] encryption system through ciphertext-only attack and iterative algorithm. Each time in decryption, it has to iterate 800 loops to get one plaintext. However, in this paper, by training CNN, it can directly constructs an equivalent key between ciphertext and plaintext. The trained CNN acts as the key to decrypt the ciphertext directly, so no additional iteration is needed in all the decryption, which means it is completely broken. In optical image encryption, the noise caused by diffraction is often unavoidable, but the influence of noise on JTC attack method is not discussed before. The CNN designed in this paper can restore the ciphertext with noise and the ciphertext with local damage. The CNN we proposed in this paper does not need to introduce additional phase terms to reconstruct the keys. It can directly find the transformation law between the plaintext and the ciphertext to construct an equivalent key. In terms of the method and efficiency, it is further improved than existing attack systems.

2. JTC architecture

Let's briefly review JTC architecture. As shown in Fig. 1, the image $f(x )$ and the input random phase mask $\alpha (x )$ are overlapped and placed at the input plane $x = a$, the key mask $h(x )$ is placed at the input plane $x = b$. The intensity distribution of the Fourier transform of the input function, which is the joint transform power spectrum, can be expressed as:

(1)$$\begin{aligned} E(u) &= {|{FT[{\alpha (x - a)f({x - a} )+ h({x - b} )} ]} |^2}\\ & = {|{A(u) \times F(u)} |^2} + 1 + {[{A(u) \ast F(u)} ]^\ast }H(u )\\ &\quad \times \exp [{ - i2\pi ({b - a} )u} ]+ [{A(u )\ast F(u )} ]{H^\ast }(u )\\ &\quad \times \exp [{ - i2\pi ({a - b} )u} ], \end{aligned}$$

where FT[] denotes the Fourier transform, $A(u )$, $F(u )$, $H(u )$ respectively denote the Fourier transform of $\alpha (x ),f(x ),h(x )$, while * and $^\ast $ denote convolution and complex conjugation symbols. The key can be placed at $x = b$ on the input plane when decrypting, then after Fourier transform, it is multiplied with ciphertext in the frequency domain. Finally, we can use the inverse Fourier transform to obtain the decryption image:

(2)$$\begin{aligned}f^{\prime}(x )&= h(x )\ast [{\alpha (x )f(x )} ]\otimes [{\alpha (x )f(x )} ]\ast \delta ({x - b} )\\ &\quad + h(x )\ast \delta ({x - b} )+ h(x )\ast h(x )\otimes [{\alpha (x )f(x )} ]\\ &\quad \ast \delta ({x - 2b - a} )+ \alpha (x )f(x )\ast \delta ({x - a} ), \end{aligned}$$

where ${\otimes}$ denotes correlation, $\delta$ denotes the Dirac delta function, and the last item is the plaintext image obtained by decryption.

Fig. 1. JTC encryption system.

Download Full Size | PDF

We use the same keys in the processes of encryption and decryption, which reduces the difficulty of optical implementation. The unique encryption structure and optical characteristics of the JTC architecture make it unnecessary to align each optical element as accurately as the traditional DRPE. Even if the key is not placed at the position set in the input plane. There is no impact on the quality of the decryption image, only the position of the decrypted image is changed. In addition, its encryption result is the intensity of joint power spectrum of the input image that is convenient for output and storage. All of these bring great convenience to encryption and decryption.

3. CNN architecture

Next, we will build our proposed CNN. Neural network can deal with two kinds of problems. One is classification and the other is regression. At present, some classical convolutional neural network architecture such as vgg16 [15] and full convolutional neural network (FCN) [12] are used to deal with image classification problems. The attack on the optical image encryption system studied in this paper belongs to the use of neural networks to deal with regression problem. Unlike the usual regression problem, the final result has only one value. The method in this paper is actually to deal with the multi-value regression problem, that is, image features are extracted by CNN, and then pixel by pixel regression is carried out according to the plaintext of the encrypted image. The designed CNN is presented in Fig. 2.

Fig. 2. CNN architecture.

Download Full Size | PDF

First, the input ciphertext will pass through a convolution layer with a convolution kernel size of 3${\times}$3, although the large convolution kernel can process more pixels at one time, the performance of natural feature extraction is better. One 5${\times}$5 convolution kernel is equivalent to two 3${\times}$3 convolution kernels, but the calculation speed of the large convolution kernel is slower and the number of parameters is more. Therefore, small convolution kernel is usually used, which requires fewer parameters and produces more features. In the convolutional layer, the activation function is set as sigmoid function, and the expression is as follows:

(3)$$S(x )= \frac{1}{{1 + {e^{ - x}}}},$$

its derivative is:

(4)$${S^{\prime}}(x )= \frac{{{e^{ - x}}}}{{{{({1 + {e^{ - x}}} )}^2}}} = S(x )[{1 - S(x )} ].$$

The primary effect of activation function is to provide nonlinear modeling ability of the network. If there is no activation function, the network can only express linear mapping. Even if there are more hidden layers, the whole network is equivalent to single-layer neural network. Therefore, it can also be considered that only after the activation function is added, the neural network has the hierarchical nonlinear mapping learning ability. Sigmoid is the most widely used type of activation function with the shape of exponential function, which is closest to biological neurons in the physical sense.

As shown in Fig. 3(a), the sigmoid function is continuous, smooth and strictly monotonic, which is a very good threshold function. The derivative of sigmoid function is its own function, and its image is shown in Fig. 3(b), so the calculation is very convenient and time saving. Sigmoid as an activation function also has some defects, it may cause gradient explosion or gradient disappearance when calculating neural networks with many layers. However, the neural network in this paper has fewer layers, so there is no need to worry about it. After passing through the convolution layer, the image will enter the pooling layer. Here we use max pooling to reserve the maximum value from the pixel block of 2${\times}$2 and discard other values, which will reduce the size of the image by half. After the pooling layer, the image features can be further highlighted and the computational complexity of the network can be simplified. Then there is the dropout layer. Using the dropout layer, a part of neurons can be randomly discarded when calculating the neural network. The dropout rate set here is 0.2, and 20% of the neurons will be discarded in the actual calculation. Dropout can effectively prevent the neural network from overfitting and accelerate the calculation speed of the neural network. Of course, L1 and L2 regularization can prevent over fitting, but the neural network designed in this article is enough to use dropout. After another convolution, pooling, and dropout, the data is flattened into a vector, and it is connected to the output layer in a fully connected way. We choose MSE as the loss function:

(5)$$MSE = \frac{1}{n}\sum\limits_{i = 1}^n {{{({{y_i} - y_i^{\prime}} )}^2}} ,$$

where ${y_i}$ denotes training label, $y_i^{\prime}$ denotes predicted value. In the training of the neural network, by constantly updating the parameters of the neural network, the loss function is constantly reduced, so as to train a neural network model with higher accuracy. The neural network optimizer is Adaptive Moment Estimation (Adam) [31], which stores the average attenuation value of the past square gradient and the dynamic change of each parameter, ${m_t}$ denotes biased first moment estimation, ${v_t}$ denotes biased second moment estimation:

(6)$${m_t} = {\beta _1}{m_{t - 1}} + ({1 - {\beta_1}} ){g_t},$$

(7)$${v_t} = {\beta _2}{v_{t - 1}} + ({1 - {\beta_2}} ){g_t}.$$

Where ${\beta _1}$ and ${\beta _2}$ denote exponential decay rate of first and second order moment estimate, ${g_t}$ denotes current gradient value. The initialization bias is easily offset, so we can get the bias correction:

(8)$$\widehat m = \frac{{{m_t}}}{{1 - \beta _1^t}},$$

(9)$$\widehat v = \frac{{{v_t}}}{{1 - \beta _2^t}},$$

(10)$${\Theta _{t + 1}} = {\Theta _t} - \frac{\eta }{{\sqrt {\widehat v +{\in} } }}\widehat {{m_t}}.$$

Fig. 3. The activation function and its derivative set in the convolution layer. (a) Sigmoid function, (b) Derivative of sigmoid function.

Download Full Size | PDF

Where $\widehat m$, $\widehat v$ denote bias-corrected ${m_t}$, ${v_t}$, ${\Theta _t}$ denotes parameters to be updated in neural network, $\eta$ denotes learning rate, used to control stride, ${\in}$ is a small constant used to stabilize values. Adam combines the advantages of other optimizers, which can make the neural network get better result in less training times.

After building the neural network, you can import data for training and testing, as shown in Fig. 4. Here we use the MNIST database of handwritten digit images [32], and use the five thousand images as training samples. We first encrypt them by JTC architecture to get ciphertexts, and then input the ciphertexts of size 84${\times}$84 from the input layer. If high-resolution images are used for training, the corresponding training time will be longer. After passing the convolution layer of 25 filters, there will be 25 channels. Because the size of the filter is 3${\times}$3, if the processing at the edge of the image is not taken into account, the resulting channel will be 82${\times}$82. Therefore, a cubic image of 25${\times}$82${\times}$82 can be obtained through the first compromise. After max pooling, the size of each image becomes half of its original size 25${\times}$41${\times}$41. Again after another convolution layer with 50 filters, the image becomes 50${\times}$39${\times}$39, the size after max pooling is 50${\times}$19${\times}$19, and finally it is fattened into a 1${\times}$18050 vector, and then fully connected with the 1${\times}$784 vector. Because the training label is 28${\times}$28 pixels, there are 784 pixels after fatten. The CNN in this paper belongs to supervised learning. We take plaintext as the training label, constantly calculate the mean square deviation between the predicted value of the output layer and the label, and then update the parameters of the neural network to make the mean square deviation smaller and smaller. After training CNN, we can use it to decrypt ciphertext.

Fig. 4. Schematic diagram of attack scheme using CNN.

Download Full Size | PDF

4. Result analysis

We use Keras deep learning framework based on Tensorflow for neural network training, the computer used is equipped with 32GB of RAM, NVIDIA GeForec RTX2080 GPU. In MNIST database, 5000 pictures were selected and trained 50 times, and the loss function was well convergent, as shown in Fig. 5. The whole training process took only two and a half hours. Now we use this trained CNN to attack ciphertext.

Fig. 5. Convergence curve of CNN in training.

Download Full Size | PDF

Select other images which are not used for training from the dataset to test the system. The results of the attack are shown in Fig. 6, Figs. 6(a)–(d) are the original images. Figures 6(e)–(h) are the ciphertexts after JTC image encryption. We can see it from the figure that human eyes can not recognize the encrypted images. Figures 6(i)–(l) show the images attacked by the CNN. We can recover the ciphertexts into clear and visible images without using the decryption keys. In order to quantitatively analyze the quality of the decrypted image after CNN attack, we calculate the PSNRs of the decrypted images. The PSNRs of Figs. 6(i)–(l) are 28.61 dB, 32.78 dB, 30.59 dB, 27.75 dB respectively. Compared with the previous JTC attack method [28–29], the obtained plaintext has better quality.

Fig. 6. The decryption of the ciphertexts by the trained CNN. (a)-(d) are plaintext images, (e)-(h) are ciphertext images, (i)-(l) are decryption images.

Download Full Size | PDF

For the sake of generality, we also use the fashion-MNIST database [33] to test the CNN in this paper. This data set collects various clothes, shoes and hats in daily life. The test images selected are also not used for training. Figures 7(a)–(d) are the original images, and Figs. 7(e)–(h) are the ciphertexts after JTC image encryption, Figs. 7(i)–(l) are the images attacked by the CNN. The PSNRs of Figs. 7(i)–(l) are 22.92 dB, 19.98 dB, 19.38 dB, 23.55 dB respectively. Although the decryption images are missing in the details, the overall outline features are still better restored.

Fig. 7. The neural network decrypts the images in the Fashion-mnist database. (a)-(d) are plaintext images, (e)-(h) are ciphertext images, (i)-(l) are decryption images.

Download Full Size | PDF

To test the robustness of the neural network, we cover up a quarter and a third of the ciphertext, and then import them into CNN. Figures 8(a)–(d) show the ciphertexts that cover a quarter, and Figs. 8(e)–(h) show the plaintexts attacked by the CNN. Figures 8(i)–(l) are ciphertexts that cover up one third, and Figs. 8(m)–(p) are plaintexts attacked by CNN. The PSNRs of Figs. 8(e)–(h) are 19.84 dB, 20.88 dB, 20.37 dB, 20.06 dB respectively. The PSNRs of Figs. 8(m)–(p) are 19.69 dB, 19.66 dB, 18.87 dB, 19.31 dB respectively. We can see it that even if part of the ciphertexts are missing, the neural network in this paper can extract enough information from them and attack the ciphertexts.

Fig. 8. Simulation results when parts of the ciphertexts are missing. (a)-(d) are to cover up a quarter of the ciphertexts, (e)-(h) are the results of the attack through the CNN, (i)-(l) are to cover up a third of the ciphertexts, (m)-(p) are the results attacked by the CNN.

Download Full Size | PDF

Furthermore, we add Gaussian noise and salt-and-pepper noise to the ciphertexts respectively and observe their attack results. Figures 9(a)–(d) are ciphertexts after adding pepper-and-salt noise, and Figs. 9(e)–(h) are plaintexts attacked by the CNN. Figures. 9(i)–(l) are ciphertexts after adding Gaussian noise, and Figs. 9(m)–(p) are plaintexts attacked by CNN. The PSNRs of Figs. 9(e)–(h) are 23.79 dB, 23.02 dB, 23.55 dB, 24.98 dB respectively. The PSNRs of Figs. 9(m)–(p) are 21.91 dB, 26.57 dB, 24.34 dB, 24.98 dB respectively. It can be seen that noise has little effect on the prediction results of the CNN proposed in this paper.

Fig. 9. Neural network attacks on images that add noise. (a)-(d) are ciphertexts after adding salt-and-pepper noise, (e)-(h) are the results of the attack through the CNN, (i)-(l) are ciphertexts after adding Gaussian noise, (m)-(p) are the results of the CNN attack.

Download Full Size | PDF

If the previous neural network is used directly, the ideal results can not be obtained under the same training level (using the same amount of data, according to the training time and epochs in this paper). The results are shown in Fig. 10. Therefore, the proposed method has better results under the same condition compared with previous method.

Fig. 10. The results of JTC attack using different neural networks. (a) is the ciphertext image, (b) is the decryption image by using the method proposed in this paper, (c) is the plaintext image, (d) is the decryption image using previous method.

Download Full Size | PDF

5. Conclusion

In this paper, a CNN was proposed, and the ciphertext can be decrypted into plaintext directly through the system. The proposed neural network can find the transformation law between plaintext and ciphertext and construct an equivalent key. No additional iteration is needed in all the decryption, which means it is completely broken. It adopts the sigmoid activation function and dropout layers in the design, which makes the attack have higher efficiency. The neural network has good robustness, even if the ciphertext is cut or polluted by noise, it can be better restored into plaintext. Computer simulations prove its feasibility and effectiveness. Using a neural network to attack the optical encryption system is a very novel method, which can promote the continuous improvement of the encryption system and open up a novel way of cryptanalysis.

Funding

National Natural Science Foundation of China (61505046); Natural Science Foundation of Zhejiang Province (LY19A040010).

Acknowledgments

The computing software was supported by Mathematics Center of Hangzhou Dianzi University.

Disclosures

The authors declare no conflicts of interest.

References

1. P. Refregier and B. Javidi, “Optical image encryption based on input plane and Fourier plane random encoding,” Opt. Lett. 20(7), 767–769 (1995). [CrossRef]

2. B. Javidi, “Securing information with optical technologies,” Phys. Today 50(3), 27–32 (1997). [CrossRef]

3. G. Situ and J. Zhang, “Double random-phase encoding in the Fresnel domain,” Opt. Lett. 29(14), 1584–1586 (2004). [CrossRef]

4. G. Unnikrishnan, J. Joseph, and K. Singh, “Optical encryption by double-random phase encoding in the fractional Fourier domain,” Opt. Lett. 25(12), 887–889 (2000). [CrossRef]

5. Z. Liu and S. Liu, “Random fractional Fourier transform,” Opt. Lett. 32(15), 2088–2090 (2007). [CrossRef]

6. Y. Zhang and B. Wang, “Optical image encryption based on interference,” Opt. Lett. 33(21), 2443–2445 (2008). [CrossRef]

7. L. Sui, M. Xin, and A. Tian, “Multiple-image encryption based on phase mask multiplexing in fractional Fourier transform domain,” Opt. Lett. 38(11), 1996–1998 (2013). [CrossRef]

8. Z. Liu, H. Chen, W. Blondel, Z. Shen, and S. Liu, “Image security based on iterative random phase encoding in expanded fractional Fourier transform domains,” Opt. Lasers Eng. 105, 1–5 (2018). [CrossRef]

9. J. Wu, B. Haobogedewude, Z. Liu, and S. Liu, “Optical secure image verification system based on ghost imaging,” Opt. Commun. 399, 98–103 (2017). [CrossRef]

10. J. Barrera, A. Mira, and R. Torroba, “Optical encryption and QR codes: secure and noise- free information retrieval,” Opt. Express 21(5), 5373–5378 (2013). [CrossRef]

11. T. Nomura and B. Javidi, “Optical encryption using a joint transform correlator architecture,” Opt. Eng. 39(8), 2031–2035 (2000). [CrossRef]

12. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 3431–3440 (2015).

13. H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Proceedings of IEEE International Conference on Computer Vision, IEEE, 1520–1528 (2015).

14. Z. Zhang, Q. Liu, and Y. Wang, “Road extraction by deep residual u-net,” in Proceedings of IEEE Geoscience and Remote Sensing LettersIEEE, 749–753 (2018).

15. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv 1409, 1556 (2014).

16. T. H. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, and Y. Ma, “PCANet: A simple deep learning baseline for image classification,” in Proceedings of IEEE transactions on image processing, IEEE, 5017–5032 (2015).

17. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 770–778 (2016).

18. K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep CNN denoiser prior for image restoration,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 3929–3938 (2017).

19. H. Hai, S. Pan, M. Liao, D. Lu, W. He, and X. Peng, “Cryptanalysis of random phase encoding based optical cryptosystem via deep learning,” Opt. Express 27(15), 21204–21213 (2019). [CrossRef]

20. L. Zhou, Y. Xiao, and W. Chen, “Machine-learning attacks on interference-based optical encryption: experimental demonstration,” Opt. Express 27(18), 26143–26154 (2019). [CrossRef]

21. L. Zhou, Y. Xiao, and W. Chen, “Learning-based attacks for detecting the vulnerability of computer-generated hologram based optical encryption,” Opt. Express 28(2), 2499–2510 (2020). [CrossRef]

22. L. Zhou, Y. Xiao, and W. Chen, ““Vulnerability to machine learning attacks of optical encryption based on diffractive imaging,” Opt. Lasers Eng. 125, 105858 (2020). [CrossRef]

23. Y. Qin, H. Wan, and Q. Gong, “Learning-based chosen-plaintext attack on diffractive-imaging-based encryption scheme,” Opt. Lasers Eng. 127, 105979 (2020). [CrossRef]

24. X. Peng, H. Wei, and P. Zhang, “Chosen-plaintext attack on lensless double-random phase encoding in the Fresnel domain,” Opt. Lett. 31(22), 3261–3263 (2006). [CrossRef]

25. X. Peng, P. Zhang, H. Wei, and B. Yu, “Known-plaintext attack on optical encryption based on double random phase keys,” Opt. Lett. 31(8), 1044–1046 (2006). [CrossRef]

26. X. Wang and Z. Dao, “Amplitude-phase retrieval attack free cryptosystem based on direct attack to phase-truncated Fourier-transform-based encryption using a random amplitude mask,” Opt. Lett. 38(18), 3684–3686 (2013). [CrossRef]

27. M. Liao, W. He, D. Lu, and X. Peng, “Ciphertext-only attack on optical cryptosystem with spatially incoherent illumination: from the view of imaging through scattering medium,” Sci. Rep. 7(1), 41789 (2017). [CrossRef]

28. J. Barrera, C. Vargas, M. Tebaldi, and R. Torroba, “Chosen-plaintext attack on a joint transform correlator encrypting system,” Opt. Commun. 283(20), 3917–3921 (2010). [CrossRef]

29. J. F. Barrera, C. Vargas, M. Tebaldi, R. Torroba, and N. Bolognini, “Known-plaintext attack on a joint transform correlator encrypting system,” Opt. Lett. 35(21), 3553–3555 (2010). [CrossRef]

30. C. Zhang, M. Liao, W. He, and X. Peng, “Ciphertext-only attack on a joint transform correlator encryption system,” Opt. Express 21(23), 28523–28530 (2013). [CrossRef]

31. D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” In International Conference for Learning Representations (2015).

32. L. Deng, “The MNIST database of handwritten digit images for machine learning research,” IEEE Signal Process. Mag. 29(6), 141–142 (2012). [CrossRef]

33. H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: anovel image dataset for benchmarking machine learning algorithms,” arXiv 1708, 07747 (2017).

Plaintext attack on joint transform correlation encryption system by convolutional neural network

Abstract

1. Introduction

2. JTC architecture

3. CNN architecture

4. Result analysis

5. Conclusion

Funding

Acknowledgments

Disclosures

References

Cited By

Figures (10)

Equations (10)

Optics Express