Bayesian Neural Networks for Reversible Steganography

Recent advances in deep learning have led to a paradigm shift in the field of reversible steganography. A fundamental pillar of reversible steganography is predictive modelling which can be realised via deep neural networks. However, non-trivial errors exist in inferences about some out-of-distribution and noisy data. In view of this issue, we propose to consider uncertainty in predictive models based upon a theoretical framework of Bayesian deep learning, thereby creating an adaptive steganographic system. Most modern deep-learning models are regarded as deterministic because they only offer predictions while failing to provide uncertainty measurement. Bayesian neural networks bring a probabilistic perspective to deep learning and can be regarded as self-aware intelligent machinery; that is, a machine that knows its own limitations. To quantify uncertainty, we apply Bayesian statistics to model the predictive distribution and approximate it through Monte Carlo sampling with stochastic forward passes. We further show that predictive uncertainty can be disentangled into aleatoric and epistemic uncertainties and these quantities can be learnt unsupervised. Experimental results demonstrate an improvement delivered by Bayesian uncertainty analysis upon steganographic rate-distortion performance.


I. INTRODUCTION
A RTIFICIAL intelligence arises from the question 'Can machines think?' [1]. Machine learning refocuses attention on addressing solvable problems of a practical nature automatically through the use of data. Deep learning is a subclass of machine-learning algorithms based on neural networks and connectionism [2]- [4]. The advent of datacentric artificial intelligence is accompanied by cybersecurity concerns. It has been reported that machine-learning models are vulnerable to adversarial attacks [5]. One such example is the introduction of invisible perturbations to data to cause erroneous decision-making [6]. Another example is the injection of poisonous data that subsequently contaminates and disrupts the learning process [7]. There is also an insidious threat that malware codes could be hidden within a neural network model using obfuscation techniques and subsequently triggered by specific input data [8] (metaphorically a Trojan Horse stratagem). Data authentication serves a crucial role in cybersecurity, being at the forefront of defensive strategies against various cybercrimes. Modern cryptography offers a variety of approaches to authentication (e.g. digital signatures [9] and trusted timestamps [10]). However, the maintenance of such auxiliary information imposes an addi-tional burden upon data management requirements.
Steganography is the art and science of covering messages within digital media [11]. It can serve as a potential solution for managing auxiliary information. By embedding an auxiliary message into its corresponding data sample, the link between the sample and the message is naturally preserved throughout the entire information lifecycle. While steganographic distortion is generally imperceptible, reversibility is required for applications in which data integrity is a major priority (e.g. forensic science, legal proceedings, medical diagnosis, and military reconnaissance) [12]- [17]. When a steganographic process is irreversible, distortion may accumulate over time and eventually render data worthless. The recent development of deep learning has brought a paradigm shift in reversible steganography [18]- [20]. Similar to lossless compression [21], predictive modelling forms a fundamental pillar of reversible steganography for analysing redundancy in digital signals. It has been reported that deep neural networks can be used as advanced predictive models [22]- [24]. Despite the improved accuracy offered by neural networks, non-trivial prediction errors still occur when making inferences about some out-of-distribution and noisy test data. This leads us to the study of uncertainty in deep learning.
Most modern deep-learning models are regarded as deterministic as they only offer predictions while lacking confidence bounds for data analysis and decision-making, thereby incurring risks in automated systems. As a safety concern, it is important to be aware of the limitations of a machine that is deployed in real-world settings and granted autonomous control [25]. While the notion of machine consciousness is illusive and there is no indication that contemporary artificial intelligence is anywhere close to engendering that, uncertainty quantification would be a principal requirement for the development of self-aware intelligent machinery. Bayesian deep learning provides a way of calculating uncertainty based on a probabilistic conception [26].
In this paper, we study predictive uncertainty in deep neural networks for reversible steganography based on a theoretical framework of Bayesian deep learning [27]. Our objective is to develop a learning-based method for quantifying uncertainty caused by out-of-distribution and noisy data, thereby enabling an adaptive steganographic system and improving steganographic rate-distortion performance.

II. METHODOLOGY
We begin by formulating a reversible steganographic scheme that incorporates a Bayesian neural network and then present a derivation of uncertainty.

A. REVERSIBLE STEGANOGRAPHY
Reversible steganography considers the following scenario. A sender communicates a message to a receiver by introducing removable modifications to a carrier signal. We refer to the original signal as the cover and its modified counterpart as the stego. Residual modulation is a conventional technique used to hide messages within digital images in a reversible fashion. There are many variations of residual modulation [28]- [32]. An optimal coding for residual modulation is the subject of ongoing research and the choice has few implications for the findings of the study. The following workflow describes a scheme based on residual modulation that incorporates a Bayesian neural network (BNN), as illustrated in Figure 1. In the preliminary phase, pixels of an image are divided into a context set and a query set, denoted respectively by x and y. A common method of context/query splitting is to form a chequerboard pattern such that each query pixel has 4 adjacent context pixels connected horizontally and vertically. A BNN is then deployed to predict the intensity of each query pixel as well as to estimate the inherent variance in data based on the given contextual information: For the time being, we assume an uncertainty map as being derived from either/bothŷ or/and σ 2 . The map indicates an estimated uncertainty over the prediction for each pixel and provides guidance on message embedding. Uncertainty may be caused by certain rare and stochastic patterns in images. Residual modulation is premised on a statistical principle (i.e. law of error) that residuals generally centre around zero and the frequency of a residual is inversely proportional to its magnitude [33]. In general, it assigns residuals of small magnitude as the stego channel to carry the payload at the expense of causing greater distortion to large residuals. Based on the supposition that the expected residual magnitude can be captured by the uncertainty map, we can modulate the residuals in order of ascending uncertainty. In practice, we sort them by ascending uncertainty prior to modulation, leading to an adaptive embedding pathway (in contrast to a sequential or random pathway). In the encoding phase, residuals are computed by = y −ŷ. Message bits are embedded by modulating the residuals, causing recoverable distortion to the residuals: The modulated residuals are added to the predicted intensities, resulting in a stego query set y =ŷ + . Message extraction and image restoration are performed in a similar manner. To begin with, the procedures in the preliminary phase are carried out, yielding the same results since the contextual information remains intact. In the decoding phase, the embedding order is identified in accordance with the uncertainty map and the residuals are computed by = y −ŷ. The message is extracted and the residuals revert to their pristine state based on the corresponding de-modulation algorithm: { , m} = demodulate( ).
Finally, the original image is recovered by y =ŷ + .

B. BAYESIAN INFERENCE
A neural network is a non-linear function that maps an input and model parameters to an output: where ∼ N (0, σ 2 ) describes Gaussian observation noise.
The goal of this regression problem is to find latent model parameters θ that accurately fit the observed data. Let us denote by X the training inputs and by Y the corresponding outputs. Maximum likelihood estimation (MLE) finds the most likely setting of parameters for the set of data, as given by If we have prior knowledge about the distribution of the parameters, we can invoke Bayes' theorem and derive a posterior distribution of parameters: The denominator of the posterior is the marginal likelihood or model evidence, as defined by Since the denominator does not depend on θ, maximum a posteriori (MAP) estimation ignores it and obtains From a Bayesian interpretation, finding neural network parameters that minimise a loss function is conceptually similar to MLE, whereas training with weight-decay regularisation has a similar underlying principle to MAP [34]- [36]. Bayesian inference takes the full posterior distribution into account to support robust decision-making. Rather than relying solely on a single hypothesis (i.e. a specific setting of parameters), Bayesian inference leverages all possible settings of parameters, weighted by their plausibilities (i.e. posterior probabilities) [37]. It propagates uncertainty from the parameters to the data by deriving the (posterior) predictive distribution of y * at a test input x * using the parameter posterior: The parameter posterior involves the computation of model evidence, which requires solving an integration referred to as marginalisation. However, marginalisation is analytically intractable for deep-learning models and therefore we have to resort to approximation techniques.

C. MONTE CARLO DROPOUT
Variational inference is a technique for approximating intractable integrals in Bayesian inference [38]- [42]. Instead of evaluating the parameter posterior, we approximate it with a variational distribution q(θ), which belongs to a family of distributions of simpler form. By replacing p(θ | X , Y) with q(θ) in the predictive distribution and approximating the integral with Monte Carlo integration, we derive that whereθ t ∼ q(θ). Sampling model parameters from a variational distribution can be simulated by dropout [43], a stochastic process of multiplying the output of each neurone by a random variable drawn from a Bernoulli distribution [44]. In other words, each dropout configuration deactivates a portion of neurones, yielding a plausible realisation of the parametric model. Performing stochastic forward passes T times through a model with different dropout masks results in an ensemble of neural networks, each with a slightly different sparse graph structure. The usage of dropout at the inference stage is referred to as Monte Carlo dropout [45], as illustrated in Figure 2.

D. UNCERTAINTY DISENTANGLEMENT
Bayesian inference offers a predictive distribution from which we can derive predictive uncertainty. This is the sum VOLUME 4, 2016 of uncertainty due to observation noise and uncertainty in the model parameters. According to the law of total variance, a variance can be decomposed into 'unexplained' and 'explained' components. This permits the disentanglement of aleatoric and epistemic uncertainties from predictive uncertainty [46]: Aleatoric uncertainty captures randomness (or noise) inherent in observations: Epistemic uncertainty occurs due to limited knowledge (or training data): Hybrid predictive uncertainty can be computed by adding together aleatoric and epistemic uncertainties, each normalised with its sum. We perform dropout at test time and sample The loss function for this dual-headed BNN is composed of a distance function D and a regulariser R balanced by λ [47], as given by Let us denote by N the number of pixels in an image. The distance function is defined by where the first term is the Euclidean distance and the second term represents the inverse of the normalised variance. This weighted distance term discourages the model from causing high regression residuals with low uncertainty and attenuates loss in conditions of high uncertainty. The regulariser is defined by This regularisation term is designed to penalise a high degree of uncertainty, thereby preventing the model from inactive learning.

III. EXPERIMENTS
The primary purpose of our experiments is to identify the contribution made by Bayesian uncertainty analysis to steganographic performance. We employ an advanced predictive neural network as a baseline model for benchmarking and build an uncertainty-aware model therefrom.

A. EXPERIMENTAL SETUP
Our implementation of the BNN is based primarily on the residual dense network (RDN), which is a state-of-the-art model for low-level computer vision tasks (e.g. image reconstruction, super-resolution, and denoising) [48]. We use this model as a baseline predictive model and build an uncertainty-aware predictive model upon it. The RDN model is characterised by a tangled labyrinth of residual and dense connections. We train the RDN model on the BOSSbase dataset [49], which consists of 10,000 greyscale photographs collected for an academic competition in the field of digital steganography. The inference set comprises standard test images from the USC-SIPI dataset [50]. For the uncertaintyaware model, we apply a dropout layer after each non-linear activation function and stack two output branches to form a dual-headed model, as illustrated in Figure 3. The dropout rate is set to 0.3, the loss balancing parameter λ to 1, and the number of dropout samples T to 1,000 empirically.

Figures 4 and 5 visualise the experimental results.
The metrics used for measuring the quality of predicted images are peak signal-to-noise ratio (PSNR) and structural similarity (SSIM). In general, predictive performance is high for smooth images and low for richly textured images. The chequerboard artefacts are observed in uncertainty maps since a chequered pattern (derived from 4-connectivity) is used for context-query splitting. The intensities of context pixels in an input image are consistent with those in a corresponding target image. The results suggest that the model is capable of eliminating uncertainty regarding the context pixels. It can be seen that uncertainty is concentrated around edges, contours, rare patterns and textural details. Figure 6 depicts uncertainty quantification performance, represented by the root-meansquare error (RMSE) vis-à-vis the percentage of pixels. It illustrates the deviation of predictions as the percentage of pixels increases, where the pixels are selected in ascending order of uncertainty magnitude. The upper bound is obtained by selecting pixels in ascending order of residual magnitude, whereas the lower bound is computed by selecting pixels in random order. The results verify the validity of uncertainty input target output aleatoric epistemic predictive analysis by comparing the uncertainty-aware selection with the random selection. A more accurate uncertainty measurement is expected to produce a curve closer to the upper bound. Figure 7 evaluates steganographic performance with rate-distortion curves. Capacity is measured by the embed-ding rate expressed in bits per pixel (bpp) and distortion is measured by the PSNR expressed in decibels (dB). The results show that an adaptive embedding pathway derived by Bayesian uncertainty analysis leads to a better rate-distortion performance than a random embedding pathway.

IV. CONCLUSION
In this paper, we study reversible steganography with deep learning and analyse uncertainty in predictive models based upon a Bayesian framework. We apply the Monte Carlo dropout to approximate the predictive distribution and derive aleatoric and epistemic uncertainties therefrom. A dual-headed neural network is constructed for estimating uncertainty in an unsupervised manner. Experimental results demonstrate state-of-the-art steganographic performance benchmarked against a non-Bayesian baseline, confirming the contribution of uncertainty analysis. We hope that this article can contribute to future research on reversible steganography and envisage further progress being ushered in with new developments of Bayesian deep learning.