End-to-End Image Steganography Using Deep Convolutional Autoencoders

Image steganography is used to hide a secret image inside a cover image in plain sight. Traditionally, the secret data is converted into binary bits and the cover image is manipulated statistically to embed the secret binary bits. Overloading the cover image may lead to distortions and the secret information may become visible. Hence the hiding capacity of the traditional methods are limited. In this paper, a light-weight yet simple deep convolutional autoencoder architecture is proposed to embed a secret image inside a cover image as well as to extract the embedded secret image from the stego image. The proposed method is evaluated using three datasets - COCO, CelebA and ImageNet. Peak Signal-to-Noise Ratio, hiding capacity and imperceptibility results on the test set are used to measure the performance. The proposed method has been evaluated using various images including Lena, airplane, baboon and peppers and compared against other traditional image steganography methods. The experimental results have demonstrated that the proposed method has higher hiding capacity, security and robustness, and imperceptibility performances than other deep learning image steganography methods.


I. INTRODUCTION
Technology is steering the social and economic developments.With the introduction of the Internet of Things (IoT), the necessity for the transfer of data between the sensors, cloud servers, and end devices has become inevitable.There is a complete changeover in the modes and forms of communication and digital media has become the primary mode of communications.Even in offices, people have shifted from using paper-based memos to electronic mails for communication.With the unlimited computational power and storage, transfer of images and videos have become possible.Images and videos are preferred as the predominant form of communication.On the other side, with all these technological developments, the need for protecting the privacy, security, and confidentiality of the data that is being transferred arises.Information hiding techniques are potentially a useful rescue to facilitate the secure transfer of data.
The associate editor coordinating the review of this manuscript and approving it for publication was Yongqiang Zhao .
Information security methods can be classified into three categories: digital watermarking, steganography, and cryptography.Cryptography is the most widely used method and works by converting the plaintext into ciphertext.The ciphertext generated by a cryptographic protocol is visible to human eyes leading to suspicion.Image steganography is the art and science of hiding secret information inside an image file.It is a research field that is gaining popularity to secure data transfer or storage.On the other hand, image steganalysis can be used to break and uncover the embedded secret information from the image.Steganography methods can deal with a variety of digital media like images, videos, text and audio.The data structure of the secret message is not modified and the hidden message is not visible, hence reducing any suspicion and scrutiny.
Image steganography is correlated to the prisoners' problem, where Alice and Bob are in prison.They want to escape, but, any communication between them is monitored by Eve.Eve is suspicious of Alice and Bob and so scrutinizes all the communication between them.In this case, Alice and Bob should communicate in a way that is understandable to only them.For Eve it should look like normal message [1].
Image steganography has attracted significant interests from the research community resulting in considerable enhancements in line with the recent advances of digital media technologies.Initially, traditional methods such as Least Significant Bits (LSB) substitution, Pixel Value Differencing (PVD), Discrete Wavelet Transformation (DWT) were used for hiding the secret information inside a carrier image by exploiting the pixel value of the cover image [2].However, the hiding capacity of the traditional methods is limited due to embedding capacity issues which can lead to the exposure of the presence of the secret media.Deep learning is another paradigm which is extensively used in computer vision applications.Deep learning methods including autoencoders and generative adversarial networks (GANs) have been adapted and employed in image steganography.These methods have increased the hiding capacity, security and robustness compared to traditional methods.Reversible image steganography is a technique where steganography and steganalysis are performed together [3].Reversible image steganography using deep learning method has further increased hiding capacity, security and robustness.In this paper, an autoencoder-decoder architecture model is proposed for performing steganography and steganalysis.The main aim of this paper is to design a reversible end-to-end image steganography model.The contribution of this work is listed as follows: • A simple and light weight autoencoder-decoder model architecture is used for embedding and extracting the secret image inside the cover image.
• Instead of concatenating the raw images, features extracted from the cover image and secret image are concatenated.This reduces the amount of redundant data in the raw images and uses only the most important features of the images.
• A customized loss function is designed and exchanged as feedback to the training model to optimize the learning function.

II. RELATED WORKS
Steganography is not a new topic and has been in existence since 440 BC.It has historically evolved from shaving the heads of the sub-ordinates, engraving the secret message using invisible inks during the world war to digital steganography in recent times.Statistical methods including the LSB substitution, PVD methods are used to hide secret messages in cover images.In the LSB method, the secret information is converted into binary bits which are embedded in the least significant bits of the carrier image [4].It is assumed that the resolution of the cover image is high and exploiting the three least significant bits for every byte in the image will not reduce the precision or arouse any suspicion [5] and [6].
Another variant of LSB uses Huffman encoding on the secret information before embedding [7].PVD is another statistical method used to hide larger number of binary bits of the secret information in the edges of a cover image and smaller number in the smoother regions of a cover image [8].
Information hiding on quantum images with modification of the direction technique [9], [10], local binary patterns [11] are some of the other techniques used.Most of the traditional methods can handle only text as the secret message with reduced hiding capacity.Reverse engineering is possible to attack the steganography as the embedding happens by statistically exploiting the cover image.A combination of cryptography and steganography is also proposed to increase the overall anti-detection property [10].Integer wavelet transform (IWT) is applied on the cover image and the chaotic map is used to determine the pixels for embedding the binary bits of the secret message [12].An inverse integer wavelet transform is performed to produce the stego image.During the extraction, the chaotic map used for embedding is generated again and the secret message is extracted with the help of the generated chaotic map.Similarly, a 3D sine chaotic map is used by Valandar et al. [13].Framelet transformation is applied on the cover image to recognize the transformer coefficients to embed the secret message bits without causing any visual distortion.A novel method called the Pixel Density Histogram (PHD) is proposed for halftone images [14], [15].
Deep learning (DL) methods, which have been widely used recently, have emerged as a viable tool in image steganography applications showing attractive performance and efficiency.Typically, CNN-based architectures and GAN-based methods are used for performing end-to-end steganography and steganalysis.The Autoencoder model is a popular network architecture used in image steganography [3], [16]- [18], and [19].Three components are needed for an end-to-end image steganography and steganalysis [20], [21]: (i) the preparation of the input images -cover and secret images, (ii) the hiding network for embedding the secret image inside the cover image and (iii) the reveal network for extracting the secret image from the container steganography image.Pixelwise CNN [22] and stylenet [23] are other networks used for the implementation of image steganography.To improve the anti-detection property of the steganography algorithm, the secret image is first transformed using DCT and then encrypted using Elliptic Curve Cryptography(ECC) [24].A SegNet architecture is used as the backbone for hiding and extraction networks.The input to the hiding network is the encrypted secret image and the cover image and the output form the container image.The container is the input image for the extraction network and the output is the encrypted version of the secret image.Finally, the decryption algorithm is carried out on the revealed image to obtain the final secret image.A Pyramid Pooling layer is placed in between the down sampling and the up sampling block and an ablation study is conducted to prove that the addition of pyramid pooling layer increases the performance of the model [25].Generative Adversarial Networks (GANs) are also widely used in the field of image steganography [26] and various GAN architectures have been proposed; for example, DCGAN [27], [28], WGAN [29], [30], cycleGAN [31]- [33].Embedding simulators in the place of steganalyzers are also implemented [34], [35].A sender-receiver scheme [36], [37], coverless steganography [3], [38], [39], and cryptographyscheme based [40], [41] are some of the other variations on the implementation of image steganography with GAN model as the base.A detailed description of the recent advances in image steganography can be found in [2].
The major shortcoming in the traditional image steganography method is the low hiding capacity.Trying to hide more information by tweaking a greater number of bits in the cover may expose the hidden secret information.Since the hiding happens by statistically exploiting the pixel values, steganalysis can be easily performed by reverse engineering.This will affect the security of the method.The quality of the extracted secret information may also be subsided thus affecting the robustness.Yet another drawback is that the media of the secret communication is mostly text.Even if an image is used, only grayscale images are used.Hiding the pixel values of the three-channel secret image inside another three-channel cover image can get quite difficult in traditional methods.However, the hiding capacity, which is an issue with most of the traditional methods, can be solved using deep learning based methods.The capacity is increased at the cost of the storage space, memory, and computation time because of the complexity of the models.In this paper, a simple autoencoder architecture that can hide a secret image inside a cover image is proposed with increased hiding capacity.In addition to an increased hiding capacity, security and robustness of the proposed method are also higher than the traditional methods.

III. PROPOSED METHODOLOGY
The overall workflow of the proposed method is given in figure 1 and consists of three modules -preprocessing module, embedding network and the extraction network.The preprocessing module prepares the cover image and secret image for the embedding network to reconstruct the stego image.The purpose of the embedding network is to reconstruct the stego image which hides the secret image inside the cover one.The extraction network recovers the hidden secret image from the container stego image.The preprocessing module together with the embedding network is placed at the sending end to produce the stego image.The extraction network is deployed at the receiving end to extract the secret image from the stego image.More details on each of the modules are given in the below subsections.
Mathematically, the proposed solution can be expressed as follows.Let c be the cover image and s be the secret image, the preprocessing module produces features f(c) and f(s) for the cover and the secret image.The final output of the preprocessing module is the aggregate of the features extracted f (c) + f (s).The main aim of the embedding method is to produce a stego image c such that c ≈ c and extraction network is to extract the secret image s which is s ≈ s.

A. PREPROCESSING MODULE
Instead of processing the raw form of the cover and the secret images, features are extracted from them using the preprocessing module.High resolution images often contain redundant data and by extracting the most meaningful features, the burden on the embedding network is reduced.The input size should be of the format m × m × n, which represents the three dimensions -width, height and depth.The width and height should be of the same size hence they are represented by m.After a thorough analysis of the existing literature [3], [21], [24], [39], the input size of the cover image is fixed to be 256 × 256.The input secret image can be of any size, the preprocessing module resizes the secret image to 256 × 256 since the cover image and the secret image should be of same size.The resize function from the skimage library is used to resize the cover image and the secret image to a fixed size of 256 × 256.Instead of representing the input images as colour gradients, the preprocessing module converts them into useful features that can be used by the embedding network.The preprocessing module consists of one input layer and three convolutional layers with increasing number of filters.The choice of the number of filters, filter size and the stride are purely dependent on the application.The main purpose of the preprocessing module is to extract usable and meaningful features through convolutional layers with different filter sizes.Initially, lower-level local features such as edges are extracted by using smaller filter sizes.The filter size is increased to help the model learn more sophisticated features.The number of filters used are 8, 16 and 32.The cover image and the secret image are passed through the preprocessing module in parallel.Finally, a merge layer is designed which concatenates the features extracted from the cover image and the secret image.

B. EMBEDDING NETWORK
The preprocessing module and the embedding network together are designed based on an auto-encoder architecture concept.The embedding network along with the preprocessing module have a hourglass structure with an expanding phase and a contracting phase.The autoencoder network takes the input and extracts the features using the encoder part.The latent space in an autoencoder is the feature representation of the input.The decoder part of the autoencoder is used to reconstruct the output image from the latent space.Image steganography applications does not require any dimensionality changes, the latent space should be the combined feature representation of the cover image and the secret image.The embedding network takes the concatenated features from the preprocessing module as the input to produce a latent space and reconstruct the stego image (which is close in resemblance to the cover image) from the latent space.Every bit of the secret image is hidden across every available bit of the cover image.The embedding network is designed with two convolutional layers with an increasing number of filters.The latent space at the end of the encoder represents the finer features of both cover image and the secret image concatenated.The decoder part of the embedding network has five convolutional layers with a decreasing number of filters since there is no need for any dimensionality change(s).The number of filters in the encoder part are 64, 128 and the decoder part of the embedding network has 128, 64, 32, 16 and 8 filters.ReLU activation is added at the end of the convolutional layers to introduce linearity by giving the max value for positives and 0s for negatives.ReLU is used because it makes the training easier with better performance as it overcomes the vanishing gradient problem which is common in architectures with multiple layers.ReLU can be given as h(c) = max(0,c).A convolutional layer with 3 filters is placed at the end of the embedding network to convert the 256 × 256 × 8 feature vector into 256 × 256 × 3 stego image output.Figure 2 represents the architecture of the preprocessing module and the embedding network together.

C. EXTRACTION NETWORK
The extraction network aims to extract the secret image hidden inside the stego image.After conducting controlled experiments, an architecture identical to the embedding network seems to give the best results in extracting the secret image with minimum information loss.The extraction network has an expanding phase and a contracting phase.The number of filters, filter size, stride and other hyperparameters are fine-tuned based on the experimental results.The architecture which produced the best result is described here.The expanding encoder part of the extraction network has five convolutional layers with an increasing number of filters (8,16,32,64,128).The decoder part has five convolutional layers with an decreasing number of filters (128,64,32,16,8).Each layer is designed with an ReLU activation.The decoder of the extraction network is followed by a convolutional layer with 3 filters to construct the extracted secret image.The extraction network architecture is given in figure 3.

D. CUSTOMIZED LOSS FUNCTION
Unlike conventional image reconstruction, image steganography process requires two input images and two output images.Therefore regular loss function may not be suitable for this purpose.A customized loss function is introduced to increase the performance of the architecture.There are two losses to be calculated: the embedding loss and the extraction loss.The embedding loss is calculated between the input cover image and the output stego image produced by the embedding network.On the other hand, the extraction loss is calculated between the input secret image and the extracted secret image by the extraction network.The overall loss is the sum of the embedding and extraction loss.
Let i be the cover image and i the reconstructed cover image with the secret image generated by the embedding network.Also, let h be the secret image and h the extracted secret image by the extraction network.The loss function has to be customized in such a way that it will help the model to optimize the learning function.Loss is a feedback measure given back to the model while training in each epoch as a measure of how well the model is performing through back-propagation.
The loss of the embedding network, L emb , is given by equation 1 and the loss of the extraction network, L ext , is given by equation 2. Finally, the overall loss, L, is calculated using equation 3.
where α is the error adjustment and is fixed to 0.3.Initial experiments were conducted by varying the values of α from 0.3, 0.6 and 0.9.Increasing the value of α increased the loss and 0.3 value produced optimal loss value.The embedding network's loss function is given back to the embedding network and the overall loss is given to the extraction network to minimize the distortions of the extracted secret image.

IV.
The experiments were conducted on ASUS laptop with Intel CORE i7, NVIDIA GEFORCE graphical card.Python 3 with Keras library using Tensorflow backend is the programming language employed throughout the experiments.Adam optimizer is used and the model was trained for 5 epochs.
Three important factors have to be evaluated for analysing the performance of the steganography model: hiding capacity, security and robustness and the imperceptibility.The hiding capacity is defined as the amount of secret information that can be hidden without distorting the cover image.In other words, this is the capacity per pixel of the steganography model.Since the cover image and the secret image have the same size (256 × 256), the capacity of the proposed method is 1.The hiding capacity can be calculated using equation 4.
where L is the length of the secret information.In this case, it is the product value of the height, width and channel of the secret image.H,C and W represent the height, number of channels and width of the cover image.Security is the ability to hide the data which can only be accessed by authorized users while the robustness is the ability of the model to embed and retrieve the secret media without distortions.Peak Signal-to-Noise Ratio (PSNR) is a popular metric used to measure the similarities between two images.First, Mean Squared Error (MSE) is calculated and the PSNR value is calculated from MSE. Equations to calculate the MSE and PSNR are given in equation 5 and 6 respectively.
Imperceptibility is a measure used to verify the visibility of the secret message hidden.Image results produced by the proposed model against the test set of each dataset is given to evaluate the imperceptibility.
Three datasets are used for training and testing the performances of the proposed architecture.Table 1 provides the necessary details on the datasets used.45000 image pairs from the datasets are taken for training and 5000 image pairs are used for testing.From the training images pair, 80 % is chosen for training and 20% is chosen for validation in random.

V. RESULTS AND DISCUSSION
The capacity of the proposed model is calculated and is then compared with a few other similar image steganography methods.It can be seen from the results that the hiding capacity of the proposed architecture is 1, which means, every bits of the secret image is hidden inside every bit of the cover image.
Some qualitative images on the test sets of the three datasets -COCO, ImageNet, CelebA are shown in figure 4. The images clearly show a higher imperceptibility of the proposed method.Even with increased hiding capacity, the secret image is hidden without any distortions to the cover image.The secret image is extracted back as well without much information loss.
The computational complexity is another important factor to be evaluated.Sometimes, based on the application scenario, a compromise between efficiency and computational complexity is made.A critical comparison of the time taken for training the embedding network on all the three datasets along with the computation time taken for the trained model to load and generate a single stego image is made in the bar chart shown in figure 5.A similar comparison for the extraction network is made in figure 5.The time taken by the embedding network is approximately twice the time taken by the extraction network.The reason is that the embedding network process two images (cover and secret), whereas, the extraction network processes only one image (stego image).
MSE and the PSNR values of the proposed method are reported and compared against state-of-the-art methods in table 4 and 2 respectively.The proposed method has comparatively better PSNR values compared with other deep learning methods [18], [21], [42].Another important aspect to be noted is that the proposed method is the lightest architecture in comparison.As shown in the table, the methods in comparison [18], [21], [42] use grayscale images as the secret media.Thus the hiding capacity is less.The proposed method uses RGB image as the secret media and so the hiding capacity is increased three-folds.In order to compare the proposed method against traditional methods, some popular images from SIPI dataset such as lena, airplane, fishing boat, truck, house, baboon and peppers are used.The same images are passed through   the proposed model and PSNR value of the stego image generated for each image is calculated.The average value of the PSNR is calculated and used for comparison against other traditional methods.The capacity of the traditional methods and the proposed method is calculated using equation 4. For traditional methods, the value of L in equation 4 is the amount of secret bits in case of text.When the secret media is an image, the value of L is the product of length, width and height of the image.The average PSNR results obtained from the proposed methods on these common images are given and compared with other traditional methods in table 3. The proposed method has only marginally high PSNR values with the methods in comparison.However, the hiding capacity is the highest for the proposed method and the secret media used is the RGB image.
From the results shown, it can be clearly seen that the proposed architecture has higher security, robustness, imperceptibility and information hiding capacity.

VI. CONCLUSION
In this paper, a light-weight but simple architecture is proposed to achieve end-to-end image steganography.The proposed architecture is inspired from the deep convolutional variation of the autoencoder.The whole system comprises of preprocessing module, embedding network and the extraction network.The preprocessing module prepares the input images for the embedding network to hide the secret image inside the cover image.The extraction network extracts the secret image from the stego image produced by the embedding network.The PSNR value of the proposed method is higher showing the higher security and robustness of the proposed method compared to other traditional and deep learning image steganography methods.The capacity of the proposed method is 1 and is the highest when compared with the traditional methods.The proposed method has an upper hand in terms of invisibility as well and can produce stego images very similar to the input cover image.

FIGURE 1 .
FIGURE 1. Overall workflow of the proposed method.The sending end consists of the preprocessing module and the embedding network.The preprocessing module extracts features from the cover image and the secret image and merges them using the concatenation layer.The embedding network reconstructs the stego image from the fused features.Finally, at the receiving end is the extraction network to extract the secret image from the stego image.

FIGURE 2 .
FIGURE 2. The architecture of the preprocessing module and the embedding network of the proposed method.

FIGURE 3 .
FIGURE 3. The architecture of the extraction network of the proposed method.

FIGURE 4 .
FIGURE 4. Qualitative image results of the proposed model on the CelebA, COCO and ImageNet datasets.(a) represents the cover image, (b) represents the secret image, (c) represents the stego image of (b) embedded in (a), and (d) represents the extracted secret image from (c).

FIGURE 5 .
FIGURE 5. Critical time analysis on (a) Embedding network, and (b) Extraction network.

TABLE 4 .
MSE and PSNR values of the proposed method.

TABLE 1 .
Details on the datasets.

TABLE 2 .
Result comparison between the proposed method and other deep learning methods.

TABLE 3 .
Result comparison between the proposed method and other traditional methods.