Deep Learning for Predictive Analytics in Reversible Steganography

Deep learning is regarded as a promising solution for reversible steganography. There is an accelerating trend of representing a reversible steo-system by monolithic neural networks, which bypass intermediate operations in traditional pipelines of reversible steganography. This end-to-end paradigm, however, suffers from imperfect reversibility. By contrast, the modular paradigm that incorporates neural networks into modules of traditional pipelines can stably guarantee reversibility with mathematical explainability. Prediction-error modulation is a well-established reversible steganography pipeline for digital images. It consists of a predictive analytics module and a reversible coding module. Given that reversibility is governed independently by the coding module, we narrow our focus to the incorporation of neural networks into the analytics module, which serves the purpose of predicting pixel intensities and a pivotal role in determining capacity and imperceptibility. The objective of this study is to evaluate the impacts of different training configurations upon predictive accuracy of neural networks and provide practical insights. In particular, we investigate how different initialisation strategies for input images may affect the learning process and how different training strategies for dual-layer prediction respond to the problem of distributional shift. Furthermore, we compare steganographic performance of various model architectures with different loss functions.

thereby compromising the accuracy of the system. To combat such cybersecurity threats, steganography-based authentication schemes can be applied to verify the integrity of training and validation data, thereby ensuring secure data exchange. If an irreversible steganographic algorithm is incorporated into a data exchange protocol for authentication purposes, steganographic distortion might present uncontrollable risks to the reliability of data-centric autonomous machines [12]. Therefore, the ability to remove steganographic distortion and restore data integrity is of paramount importance.
The core of reversible steganography, in common with lossless compression, is predictive coding [13]. Predictionerror modulation is one of the main pillars of contemporary reversible steganography due to its optimum ratedistortion performance [14]- [16]. A prediction-error modulation scheme has an analytics module to serve the purpose of context-aware pixel intensity prediction, and a coding module to perform encoding/decoding in the residual domain. Deep learning is currently at the very heart of multimedia analytics and it offers appealing solutions for pixel intensity prediction. In this paper, we investigate the cause-effect relationships between the variables of interest regarding the use of deep learning models for pixel intensity prediction. In particular, we investigate the impacts of different initialisation and training strategies upon predictive accuracy. Our contributions are summarised as follows: • We compare zero initialisation and local-mean initialisation to gain an insight into whether pre-processing of input images with a local-mean filter can help the models perform better or has a counterproductive effect. • We explore universal training, independent training, and causal training to gain an insight into the problem of distributional shift in dual-layer prediction caused by steganographic distortion. • We carry out an ablation study to analyse the impacts of different loss functions upon steganographic performance with the aim of providing insights regarding the applicability of neural network models originally proposed for low-level computer vision, and demonstrate the state-of-the-art steganographic performance with the residual dense network (RDN) which is an advanced model for image super-resolution [17] and image restoration [18].
The remainder of this paper is organised as follows. Section II provides a literature review on the recent development of steganography with deep learning. Section III delineates modules of a reversible steganographic scheme. Section IV introduces the methodology concerning intensity initialisation strategies, dual-layer training strategies, and neural network models. Section V evaluates the impact of different variables of interest upon predictive accuracy and steganographic performance. Limitations of this study are discussed in Section VI. Conclusions are drawn in Section VII.

II. LITERATURE REVIEW
Artificial intelligence has been evolving over the years, being at the forefront of transformations of the world we live in [19]. Deep learning, as a new class of multi-purpose intelligent algorithms, learns how to solve complicated tasks through the observation of a large amount of data, and has brought about groundbreaking advances in many branches of science [20]. The foundations of deep learning are neural networks, or connectionist systems, which are capable of discovering intricate structures in high-dimensional data via multiple layers of artificial neurons. The recent development of steganographic methods is centred on the use of deeplearning models to enhance performance. Significant breakthroughs have been achieved in some basic properties such as capacity (the allowable size of the embedded message) and imperceptibility (the perceptual quality of the stego object), as well as some application-oriented properties such as secrecy (the degree to which the stego object can elude detection), and robustness (the degree to which the embedded message can survive distortions of various forms) [21]. In this section, we briefly review some seminal studies of secure and robust steganography with deep learning and then discuss the use of neural networks in reversible steganography with a contrast drawn between end-to-end and modular frameworks. Note that our study focuses on the basic properties (i.e. capacity and imperceptibility) and reversibility since application-oriented requirements can be contradictory and difficult to achieve concurrently.

A. SECRECY
Secrecy is at the heart of covert communications. The ability to pass secret information under surveillance is essential to espionage and military operations. It has been reported that deep learning can be used to identify locations in a cover image at which message embedding would not arouse suspicion [22]- [24]. More technically, a deep learning algorithm assigns a cost to every pixel to quantify the effect of making modifications. In this way, a high degree of perceptual and statistical undetectability can be reached by minimising the cost. Other approaches include using generative models to convert a given message into a cover image which is less vulnerable to steganalysis [25], to transform a cover image along with a secret message into a stego image [26], and to encode a message directly into a realistic stego image in the absence of an explicit cover image [27].

B. ROBUSTNESS
Robustness is a prioritised requirement for copyright protection. In commercial applications, entrepreneurs can prevent copyright infringement by embedding a registered watermark into digital assets. Deep learning has been used to embed invisible watermarks into images and videos in a durable way in order to identify copyright ownership and deter illegitimate copying [28]. Unauthorised screen recording is a new scenario in which an adversary attempts to capture an electronically displayed still photograph or video footage with a digital camera, causing optical interference in watermark extraction. It has been shown that neural networks can be trained to simulate optical interference and then learn to encode and decode watermarks in a robust manner [29]. Decoding accuracy can also be enhanced by using neural networks to remove light artefacts such as moiré fringes, a type of textile with a rippled or wavy appearance, prior to watermark extraction [30]. To facilitate augmented reality, deep learning has been used to decode hyperlinks embedded in physical photographs rather than digital media, subject to real-world variations in print quality, illumination, occlusion and viewing distance [31].

C. REVERSIBILITY-END-TO-END FRAMEWORK
Reversibility is a desirable characteristic for applications in which accuracy and consistency of data are important. It is a requisite for preventing an accumulation of steganographic distortion over each data transmission. A reversible steganographic method often relies on intricate logical operations to regulate imperceptibility and guarantee reversibility. Such operations are hard to achieve using existing neural network models. One of the earliest approaches uses a U-shaped network (U-Net) to automatically encode a cover image and a secret message into a stego image and a separate neural network model for decoding [32]. Another approach encodes a message bitstream into a cover image via a generative model, trains a cycle-consistent generative adversarial network (Cy-cleGAN) to learn the back-and-forth transformation between the cover and stego images, and extracts the message bitstream from the stego image by using another neural network model [33]. Furthermore, an intriguing study shows that both encoding (forward mapping) and decoding (inverse mapping) can be performed by using a single invertible neural network (INN) [34]. While a monolithic end-to-end system generally offers a large steganographic capacity, a common limitation within this type of framework is imperfect reversibility due to the presence of an information bottleneck (i.e. a form of lossy compression) in neural networks. The lack of transparency, interpretability and explainability in neural networks adds an extra dimension to the problem.

D. REVERSIBILITY-MODULAR FRAMEWORK
Although reversibility is not an all-or-nothing proposition, the unreliability of end-to-end learning may pose uncontrollable risks in certain circumstances. Problem decomposition or modularisation is essential to addressing complex problems. Existing reversible steganographic schemes often consist of coding and analytics modules. In general, the coding module handles the encoding and decoding mechanisms, which are designed to achieve perfect reversibility subject to an imperceptibility constraint. The analytics module models the data distribution and exploits data redundancy in order to optimise steganographic rate-distortion performance. For instance, the regular-singular (RS) scheme introduced by Fridrich et al. constructs a discrimination function to categorise image blocks into regular, singular and unusable groups on the basis of the smoothness prior, and uses invertible flipping of the least significant bits to achieve invisible and erasable embedding [35]. We can define the former part as the analytics module and the latter part as the coding module. The problem of limited reversibility in previous deeplearning approaches lies in the difficulty of lossless coding in an end-to-end learning fashion. A recent study achieves perfect reversibility with a deep-learning-based RS scheme following this established modular framework [36]. A conditional generative adversarial network (GAN), referred to as pix2pix [37], is applied to improve the performance of the discrimination function. It has been shown that by partitioning a scheme into coding and analytics modules and deploying neural networks in the latter, reversibility can be reliably guaranteed.

E. ANALYSIS OF PRIOR ART
A combination of prediction-error modulation with deep learning has been described in two studies [38] and [39]. Neural networks are incorporated into the schemes based upon modularity. The most notable feature in common is that both studies divide pixels into query and context sets by using a chequered pattern and develop neural network models for predicting the query from the context. The main difference is neural network architectures. The former constructs a multiscale convolutional neural network (MS-CNN) to address the issue of restricted receptive fields in traditional predictors. The latter applies a persistent memory network (Mem-Net) originally developed for image denoising and image super-resolution [40]. From a certain perspective, contextaware pixel intensity prediction is closely related to lowlevel computer vision tasks such as image denoising and image super-resolution because they all rely largely on lowlevel (or pixel-level) features such as edges, contours and textures, rather than high-level semantics [41]. Therefore, it is expected that advanced neural networks from the lowlevel computer vision domain can be applied directly with minor modifications. Another difference is the setting of the initial intensities of the query pixels. The former simply sets the query pixels to zero and learns to predict from scratch. By contrast, the latter initialises them to the mean of neighbouring pixels and performs prediction in a coarseto-fine manner. Consequently, a problem of pixel-intensity prediction is, in a certain sense, transformed into a problem of image denoising or image super-resolution. Last but not least, the chequered pattern for context/query splitting features a dual-layer prediction: pixels assigned as the query in the first round, after partial embedding, can be assigned as the context in the second round. Such a practice gives rise to the problem of distributional shift-the data distribution in the target (online) domain for deployment shifts from that in the training (offline) domain. In consequence of the distortion introduced in the preceding steganographic process, assigning those distorted pixels as the context could introduce discrepancy and bias into the contextual information, thereby undermining predictive performance. The expected perfor-  neural network for first-layer prediction net 2 neural network for second-layer prediction scale ⇓ downscaling of possibly overflowing pixels scale ⇑ upscaling of possibly overflowing pixels split context-query splitting merge context-query merger encode prediction-error modulation (payload embedding) decode prediction-error de-modulation (payload extraction) concat concatenation of bit-streams deconcat de-concatenation of bit-streams mance relies on the extrapolation (generalisation) capability of neural networks. To summarise, this work aims to seek answers to the following questions: • Can different intensity initialisation strategies affect predictive accuracy of neural networks substantially? • Can the problem of distributional shift be mitigated, thereby enabling efficient dual-layer prediction? • Can off-the-shelf neural networks from the low-level computer vision domain be transferred to perform pixel intensity prediction?

III. REVERSIBLE STEGANOGRAPHY
In this section, we delineate a reversible steganographic scheme based on prediction-error modulation. We begin with an overview of the scheme and then dissect each scheme component.

A. SCHEME OVERVIEW
A schematic workflow of the encoding and decoding procedures is provided in Figure 1 with the definitions of variables and operations listed in Table 1. The scheme based on prediction-error modulation can be broken down into an analytics module and a coding module. The former models the distribution of natural images and predicts the intensity of query pixels on the basis of the available context pixels. The latter encodes/decodes messages into/from prediction errors. Consider the following scenario: A sender (encoder) embeds a message m into a cover image x to produce a stego image x and then transmits it to a receiver (decoder) who extracts the message and restores the cover image on receipt of the stego image. At the encoder side, we pre-process the cover image to prevent pixel intensity overflow during message embedding at the cost of generating extra auxiliary information. We treat the auxiliary information as a part of the payload and concatenate it with the intended message. Then, a chequered pattern is applied to split the image into a black set and a white set of pixels, denoted respectively by x B and x W . In the first-layer prediction, we designate the black set as the context set and the white set as the query set. The roles are reversed in the second-layer prediction. We make use of a neural network or a predictive algorithm to estimate the intensities of the query pixels based on the observed context pixels. The prediction errors or residuals, denoted by ε, are computed by subtracting the predicted values from the actual values. A portion of the payload is embedded in the residual domain through arithmetic operations. The modulated prediction errors, denoted by ε , are then added to the predicted values, resulting in a slight distortion to the query pixels. The process can be repeated once more to embed another part of the payload by swapping the roles between the context and the query.
At the decoder side, we carry out message extraction and image restoration in a last-in-first-out manner. Similar to the encoding process, the decoding process begins by splitting pixels of the stego image into the context and the query, predicting the query from the context, and calculating the predic- tion errors. A partial payload is extracted and the prediction errors are de-modulated back to the original state through inverse operations. Steganographic distortion is removed by adding the restored prediction errors to the predicted values. The decoding process can likewise be repeated once more by swapping roles between the context and query. Finally, the restored image is post-processed with the extracted auxiliary information to undo the minute modifications caused by the overflow prevention measure.
To summarise, the algorithmic steps are enumerated as follows. For the encoding phase: For the decoding phase: All images are assumed to be 8-bit greyscale throughout this study. Let us denote by ϑ a threshold for the stego channel such that (a) ϑ = 1 where abs denotes the absolute value. In other words, we use the parameter ϑ to determine for which prediction error values the payload embedding process can take place. The following presents the algorithmic details surrounding the notion of the stego channel.

B. OVERFLOW HANDLING
Both encoding and decoding are carried out in the residual domain. Addition and subtraction on residual values are equivalent to the same arithmetic operations on pixel values (i.e. a pro rata increment or decrement in pixel intensity). Given that computations are defined in a Galois finite field, such arithmetic operations may cause pixel intensity overflow: intensity values that are unexpectedly small or large wrap around the minimum or maximum after manipulations. We pre-process the image to prevent pixels from moving off boundary (or becoming saturated) and mark the locations where off-boundary pixels may occur. To distinguish between the processed pixelsx and unprocessed pixels x with the same value, we flag 1 (true) for the former and 0 (false) VOLUME 4, 2016  12  02  12  02  102  02  112  02  12  02 12

Algorithm 3 Modulation
for the latter in an overflow-status register, that is, and where The parameter ϑ is associated with the modulation step in the message embedding process. A greater value of ϑ spans a wider range of pixel values that may be modified beyond the boundary and thus having a higher probability of increasing the size of the register. The overflow-status register is an overhead of reversibility. It has to be concatenated with the message and embedded as a part of the payload. As a consequence, its size has to be deducted from the overall payload when assessing steganographic capacity. See Algorithms 1 and 2 for the pseudo-codes. The overflow handling process is codified in Table 2.

C. ENCODING/DECODING
Let y i,j denote a prediction of the pixel at location (i, j). For each query pixel, we compute its prediction error by For the prediction errors determined as the stego channel, we embed a bit of payload information by where sgn extracts the sign of the prediction error (either positive or negative) and p t is the current payload bit. For the rest of the errors, we shift them by ϑ to prevent error values from overlapping (i.e. indistinguishable from the errors determined as stego channel); that is, A modulated error is then added to the predicted value, resulting in a stego pixel Let us take the case in which ϑ = 2 for example. When ε = 0, we can map it to one of the three states, ε = 0 or ±1, to encode a ternary digit (a trit of information). When ε = ±1, we shift it to ±2 to avoid ambiguity and map it to one of the two states, ε = ±2 or ±3 (disregarding the sign), to represent one bit of information. For all other ε such that ε / ∈ (−ϑ, +ϑ), we shift each of them by ϑ in either a positive or negative direction, depending on the sign. However, converting the payload between binary (base 2) and ternary (base 3) numeral systems on the fly can be problematic. To circumvent this issue, we can map the errors with value 0 to the digit 0 if the observed payload bit p t is 0 2 ; otherwise, we read the next payload bit p t+1 and map the errors Theoretically, the mapping permits one trit or log 2 3 ≈ 1.585 bits of information to be embedded. In practice, the compromised solution embeds one or two bits with a probability of 0.5 each, thereby being capable of embedding 1.5 bits on average for ε = 0. Decoding is simply an inverse mapping. It begins by predicting the query pixel intensities and computing the prediction errors. Each error is de-modulated according to its magnitude. For errors between ±1, we map them back to 0 and extract the corresponding payload bits. For errors whose magnitude is in (1, 2ϑ) regardless of the sign, we de-modulate them by the floor division of the magnitude by 2 and extract the payload bits by the remainder of the division. For the rest errors, we shift them back by ϑ. The original pixel intensity is recovered by adding the demodulated error to the predicted value. According to the Law of Error [42], we can hypothesise that the frequency of an error approximates an exponential function of its magnitude and therefore follows a zero-mean Laplacian distribution. This symmetrical modulation pattern is derived from the fact that prediction errors are expected to centre around 0. Therefore, it is advisable to designate errors with a small absolute magnitude (of high frequency) as the stego channel. A rise in the width of the stego channel represents an increased steganographic capacity. See Algorithms 3 and 4 for the pseudo-codes. The prediction-error modulation process is codified in Table 3. Figure 2 displays stego images, their tonal distributions, and locations of carrier pixels w.r.t. different settings of stego-channel parameter ϑ. Results are generated by spreading a pseudo-random binary message over the whole image at the maximal steganographic capacity. It can be observed that the quality degradation is virtually imperceptible, the disturbance to the tonal distribution is subtle, and the pixels selected for carrying the payload are mostly clustered in smooth areas. We would like to note that this coding method is simple, readily scalable and computationally efficient but not necessarily optimal. The main theme of this study is the analytics module, which is independent of the coding module. Any coding method that reflects more accurately the statistical distribution of residuals can be applied without conflicting with the findings of this study. For the subject of mathematical optimisation of reversible steganographic coding, the interested reader is referred to the research study [43].

IV. PREDICTIVE ANALYTICS
In this section, we take a closer look at the practice of training neural networks for predictive analytics in reversible steganography. We investigate different initialisation strategies for configuring the inputs of neural networks, different training strategies for fitting neural networks with dual-layer prediction, and different neural network architectures suitable for context-aware pixel intensity prediction.

A. CONTEXT-QUERY SPLITTING
A convenient way to define pixel connectivity is to consider the von Neumann neighbourhood, that is, four adjacent pixels connected horizontally and vertically. If we sample the context and query pixels uniformly in such a way that each query pixel has 4 connected context pixels, we end up forming a pattern analogous to a chequerboard, as illustrated in Figure 3. We can define the black set B as the context and the white set W as the query (or the other way round): A naïve way to estimate the intensity of the central pixel is to calculate the mean of locally connected pixels. This heuristic predictive model, referred to as local-mean interpolation (LMI) [44], is based on the smoothness prior, a generic contextual assumption on real-world photographs.
The recent studies have shown that neural network models (e.g. MS-CNN [38] and MemNet [39]) can be applied to improve predictive accuracy. Because predictive accuracy is instrumental in steganographic rate-distortion performance, the factors that could have the impacts upon predictive accuracy are primary concerns of this study.

B. INITIALISATION STRATEGIES
The first step of the encoding phase is to split a cover image into x B and x W . While the values of the query pixels are supposed to be predicted, the initial values of query pixels x Q have to be determined and cannot be set as null since the input to the applied neural networks must be a complete image. To this end, we consider two simple initialisation strategies: zero initialisation and local-mean initialisation.

1) Zero initialisation
For zero initialisation, we set the initial value for each query pixel explicitly to 0. This strategy is rather arbitrary but computationally efficient when compared with other initialisation strategies. Apart from this, it involves little subjective knowledge, preliminary analysis, or preconception about the intensities. In other words, it involves minimal human interference in the machine learning process. We may perceive prediction with zero initialisation as a special type of image super-resolution problem in which pixels are downsampled according to a special chequered pattern. The query pixels set to zero are viewed as missing data.

2) Local-mean initialisation
For local-mean initialisation, we assign the mean intensity of 4 connected neighbours as the initial value of each query pixel. This strategy involves the computation of a local mean, but it is often the case that pre-estimation could help to accelerate the training process. We can test whether learningbased models tend to perform better and converge faster when such approximate values are pre-calculated. From another perspective, local-mean initialisation formulates the prediction problem as a image denoising problem by viewing the query pixels as noisy data.

C. TRAINING STRATEGIES
Dual-layer prediction entails the problem of distributional shift because steganographic distortion appears after the firstlayer embedding. In the first layer, the intensities of query pixels are predicted by a neural network model and then modified with the prediction-error modulation algorithm. In the second layer, the roles of context and query are switched. As a result, the context set consists of distorted pixels. This can also be viewed as uncertainty propagation in the sense that errors in the previous layer propagate to the next layer, impairing the predictive performance. Formally, we aim to minimise the loss of two test sets: To this end, we explore three training strategies: universal training, independent training, and causal training. The training strategies differ by the configurations of the training set, as illustrated in Figure 4.

1) Universal training
The universal training is a cost-efficient way to manage duallayer prediction by training a single model for performing prediction in both layers. The motivation is that the context/query switch between the first and second layers can be perceived as a simple geometric translation and convolutional neural networks (CNNs) are considered to be translationinvariant. For that reason, a single model may be sufficient and can generalise well for tackling dual-layer prediction. According to the chequered pattern, the context (or query) coordinates of the first layer, when shifted by one step (either horizontally or vertically), becomes the context (or query) coordinates of the second layer. Inspired by the translationinvariance property of classic CNN models, we conjecture that a model trained on the first set can be deployed directly to make inferences on the second test set. In other words, for both test sets, we train a universal model by

2) Independent training
The independent training is to train two models in parallel for respective layers. For inference on the second test set, we structure the training set without taking account of steganographic distortion. This strategy can serve to check whether a single translation-invariant model is adequate. If the conjecture is valid, both would yield a similar result. Although the practice of training two models indicates an extra computational cost, this strategy permits parallel training because the unmodified images rather than stego images are used as the inputs for training the second-layer predictor, that is,

3) Causal training
The causal training is to train the models in a consecutive manner such that the succeeding model is trained after the data has been modified with the preceding model. That is, a predictive model is trained and then deployed for embedding information into the training data for another model. The problem of distributional shift stems from the discrepancy between the training and deployment environments (i.e. distributions of the training and test sets). Reducing the deployment loss requires the distribution of the training set to be as close as possible to that of the test set. A potential way to remedy the problem is to inject steganographic distortion to the second-layer training set. At the first glance, it seems not surprising that the distributional shift can be mitigated by introducing steganographic distortion to the training of the second layer. However, steganographic distortion varies with the embedded message and is quite random (rather than being a fixed noise pattern). It implies that such distortion can not be simply filtered out by the models. Nevertheless, we hypothesis that neural networks can learn to predict from much more stable and reliable image representations if such distortion is presented during training. Implementing a steganographic algorithm incurs an extra computational cost and the succeeding model can only be trained once the training of the preceding model has been completed. Two models are causally connected as the prediction of the first model contributes to the prediction of the second model. Specifically, two models are trained by

D. NEURAL NETWORKS
An image whose query pixels are initialised to either zero or local mean can be viewed as a noisy and low-resolution version of the original image. The goal of context-aware pixel-intensity prediction can therefore be considered as refining the observed image into a clean and high-resolution one. Based on this perception, we can adopt advanced deeplearning models originally devised for image denoising and image super-resolution. A neural network is essentially a non-linear function that learns to minimise a pre-defined loss function by transforming the input into useful features or representations in a latent space. For the task of pixel-intensity prediction, we suggest training the models to minimise the 1 norm or mean absolute error (MAE). The reason for choosing the 1 norm over the 2 norm is that the latter tends to produce overly smoothed outputs in image enhancement tasks [45]- [47]. The 2 norm, or mean squared error (MSE), encourages producing an average of plausible solutions and is prone to being affected by outliers. There are other preferable loss functions in the field of lowlevel computer vision. An example is the Euclidean distance between the high-level feature maps extracted by a 19-layer VGG neural network [48] pre-trained on the ImageNet (a database for large-scale visual recognition) [49]. Another example is the adversarial loss which uses a discriminator to estimate the likelihood that a generated instance is real [50]. Nevertheless, we do not anticipate these ad hoc loss functions improving steganographic performance because predictionerror modulation relies mainly on pixel-wise distance. Apart from the choice of loss functions, when applying off-theshelf models, upsampling and transposed convolutional layers should be replaced with regular convolutional layers because the sizes of the input and output images are the same in pixel-intensity prediction. While the neural network architectures vary from one to another and it is difficult to foresee their performance in different tasks (due to the 'black box' nature of neural networks), we attach relative importance to shortcut connection because it mitigates the performance degradation problem when increasing network depth [51] and allows models to learn explicitly the minute differences between the input and output images [52]. To   Figure 5 and specified as follows.

1) MS-CNN
The MS-CNN extracts multi-scale image features with convolutional kernels of different sizes parallel to one another, to overcome the problem of restricted receptive fields in traditional linear predictors. The multi-scale features are aggregated in two successive convolutional layers to make a final prediction. In the implementation, the multi-scale kernel sizes are configured to 3×3, 5×5, and 7×7 with the number of channels set to 32.

2) MemNet
The MemNet consists of interconnected memory cells, each comprising a recurrent unit and a gate unit. The recurrent connectivity substantially reduces the number of trainable parameters, enabling the formation of a lightweight model for storage. The gating mechanism regulates important latent states or persistent memories, thereby mitigating the vanishing gradient problem often encountered when training deep neural networks. The recurrent unit comprises a series of residual blocks with shared weights, whereas the gating mechanism is a convolutional layer with a kernel size of 1×1. For each memory cell, the outputs from the tightly structured residual blocks (short-term memories) along with the outputs from previous cells pass through a gate unit to attain a persistent state (long-term memory). In the implementation, the number of memory cells and the number of residual blocks in each cell are both configured to 5 with the kernel size set to 3 × 3 and the number of channels set to 64.

3) RDN
The RDN combines residual connections and dense connections to learn hierarchical image representations. The model exploits hierarchical representations to the fullest by fusing features at both the global and local levels. At the local level, convolutional layers are densely connected, forming a residual dense block with a skip connection between the input and output. A convolutional layer with a kernel size of 1 × 1 is used to fuse the local feature maps. At the global level, the outputs of multiple blocks are, again, blended via a convolutional layer with kernel size 1, followed by a skip connection from a shallow layer to a deep layer. In the implementation, the number of residual dense blocks is configured to 3 and the number of convolutional layers in each block is configured to 5 with the kernel size set to 3 × 3 and the number of channels set to 64.

V. EXPERIMENTS
In this section, we report and discuss experimental results w.r.t. different predictive models. We begin by discussing the choice of benchmarks and describing the general set-up of experiments. Then, we evaluate the impacts of different initialisation and training strategies on predictive accuracy. To further evaluate the performance of different models, we analyse the distribution of prediction errors. Finally, we verify a direct correlation between predictive accuracy and steganographic performance by examining the steganographic rate-distortion curves.

A. BENCHMARKS
The main theme of this study is predictive analytics. For a fair comparison, we evaluate the performance of different predictive models based on the same steganographic algorithm (i.e. prediction-error modulation). We do not compare steganographic performance with the end-to-end framework because it approaches the problem from a different standpoint. In general, the end-to-end framework cannot reliably satisfy the requirement for perfect reversibility given the limit of current deep-learning algorithms. For the predictive analytics module, we select LMI as a benchmark model [ (c) RDN heuristics and both MS-CNN [38] and MemNet [39] as stateof-the-art models based on deep learning. All the selected models are specifically designed and applied for the prediction task with the same context-query arrangement (i.e. the chequered pattern). The mentioned deep learning models have also been shown to outperform other traditional methods for either same or different context-query arrangement, including nearest-neighbour interpolation [53], median-edge detector [54], gradient-adjusted predictor [55] and bilinear interpolation [56]. In summary, the coding algorithm and the context-query splitting pattern are held constant (as control variables) throughout the course of the investigation to prevent the interference in the experimental results. We select LMI, MS-CNN and MemNet as representative benchmarks and carry out a comparative study with the advanced RDN model.

B. EXPERIMENTAL SETUP
The neural network models are trained and tested on the BOSSbase dataset [57], which originated from an academic competition for digital steganography. It comprises a collection of 10, 000 greyscale photographs covering a wide variety of subjects and scenes. To fit the models over a broad range of hardware options with reasonable training time, all the images are resized to a resolution of 256 × 256 pixels by using the Lanczos resampling algorithm [58]. The training and test sets are randomly sampled at a ratio of 80/20. evaluations are made on a set of standard test images from the volume 'miscellaneous' in the USC-SIPI dataset [59]. To investigate predictive performance on challenging samples, we also take Brodatz texture images from the volume 'textures' in the USC-SIPI dataset [60]. We use peak signal-to-noise ratio (PSNR) (expressed in decibels; dB) [61] and structural similarity (SSIM) [62] for perceptual quality assessment. When evaluating the visual quality of predicted images, we take the whole image into account because the bias in the context pixels is almost negligible. The context pixels in each predicted image are almost always the same as those in the   original image. We use embedding rate (expressed in bits per pixel; bpp) for steganographic capacity evaluation.

C. COMPARISON OF INITIALISATION STRATEGIES
We compare the zero and local-mean initialisation strategies by the learning curves and the perceptual qualities. The learning curves are plotted in Figure 6. The curves represent the training loss over 100 epochs for the input/target pairs (x B , x). It can be observed that while the convergence rate varies from model to model, zero initialisation reaches a slightly lower loss than the local-mean initialisation in the last epoch. A possible explanation is that although setting the pixel intensity to zero seems to be abrupt and oversimplified, this approach involves minimal human intervention in the machine learning pipeline. By contrast, local-mean initialisation can be viewed as introducing subjective prior knowledge (i.e. the smoothness prior) in the training process. The perceptual qualities of predicted images are measured in Figure 7. Through observing visual quality measurements on different models, we conclude that there is virtually no difference between the impacts of two initialisation strategies upon predictive accuracy. While local-mean initialisation provides a rough estimation in advance, its contribution to predictive accuracy is negligible. Since zero initialisation has a virtue of low computational complexity, we adopt it for the remaining experiments.

D. COMPARISON OF TRAINING STRATEGIES
Predictive accuracies of the universal, independent, and causal training strategies for dual-layer prediction are depicted in Figure 8. The bars show the average PSNR and SSIM scores between the ground-truth and predicted images w.r.t. different training strategies. Steganographic distortion introduced in the first layer would propagate and decrease the predictive accuracy of the second layer. As a result, the second-layer scores are generally lower than the first-layer scores due to the distributional shift. The causal training achieves a comparatively high accuracy for the second-layer prediction, whereas the performance gap between the other two strategies is not distinct. This suggests that the causal training using a training set with a distribution similar to that of the test set can indeed alleviate the distributional shift to some extent. The narrow performance gap between the universal and independent training can be attributed to the translation-invariance property of CNNs. It suggests that training a single model can be as good as training two separate models. Although causal training requires additional computations when constructing the second training set, it is proved to be the most effective among the three training strategies. Hence, we opt for the causal training strategy for the remaining experiments.

E. EVALUATION OF VISUAL QUALITY
Visual comparisons between a heurstic model and a neural network model are shown in Figures 9 and 10. We evaluate the PSNR scores for images predicted from the LMI and RDN. The reported scores are related to the first-layer prediction. A close inspection of zoomed-in views reveals that the RDN is better able to retrieve textural areas with reasonable accuracy in comparison with the naïve LMI, owing to the ability of neural networks to learn rich patterns. Figure 11 compares predictive accuracy of the RDN, MemNet, MS-CNN and LMI on texture images. The results reflect a sig-  nificant improvement yielded by the RDN model over other predictive models, confirming the capability of the former to predict textural patterns.

F. EVALUATION OF DISTRIBUTION OF ERRORS
The distribution of prediction errors plays a pivotal role in steganographic performance. Normally, errors would follow a zero-centred Laplacian distribution and a more accurate model results in a more concentrated distribution. According to the coding algorithm, a more concentrated distribution can lead to a better steganographic rate-distortion trade-off.
To analyse the error distribution, we examine the probability distribution function (PDF), cumulative distribution function (CDF), and Lorenz curve of errors, as shown in Figures 12, 13, and 14, respectively. We use the entropy, variance, percentile statistics, and Gini coefficient to measure the degree of error concentration [63]. Shannon's entropy is maximised for a uniform distribution and the converse is equally true: a smaller entropy means a more concentrated distribution [13]. The variance is a measure of the spread of the data around the mean. The 95 th percentile indicates the maximum error magnitude below which 95% of errors fall. The Gini coefficient is a measure of statistical dispersion intended to represent the inequality of the error magnitude within an image. A coefficient of 0 expresses perfect equality (dispersion), whereas a coefficient of 1 corresponds to maximal inequality (concentration). For the overall degree of concentration, the RDN ranked in the first tier, followed by the MemNet and the MS-CNN in the second tier, with the LMI ranked last.

G. EVALUATION OF RATE-DISTORTION PERFORMANCE
The steganographic rate-distortion curves are plotted in Figure 15. The models with a higher predictive accuracy indeed achieve a better rate-distortion trade-off. The perceptual quality of stego images is inversely proportional to the embed-ding rate since distortions accumulate along the embedding process. The maximum capacity depends on the image content. Images with highly textured details would have lower capacity. A rise of ϑ increases the maximum embedding rate at the cost of compromising the rate-distortion trade-off. Amongst all models, the RDN stands out, achieving stateof-the-art steganographic performance. Apart from the 1 loss, we investigate the applicability of other common loss functions in the field of image restoration. We implement the 2 loss and the multi-loss on the RDN model. The latter is comprised of 1 loss, feature-space loss and adversarial loss. While an extra computational cost is devoted to the implmentation of the multi-loss training, the results suggest that there is marginal, if any, gain in steganographic performance by applying such loss function. This reinforces the argument that the prediction-error modulation algorithm relies primarily on pixel-wise distance.

VI. LIMITATIONS OF THE STUDY
While deep learning has revolutionised the research field of reversible steganography and led to major technological breakthroughs in terms of capacity and imperceptibility, novel application scenarios may be offered if secrecy and robustness are taken into further consideration. The current use of neural networks in the modular framework is confined to predictive analytics. By relaxing the constraint on perfect reversibility, it might be possible to apply the end-to-end framework to automatically learn to create stego objects that are undetectable under steganalysis tools and robust against common multimedia processing operations. Apropos of predictive analytics, an important aspect, apart from predictive accuracy, is predictive uncertainty. If the uncertainty about predictions can be estimated, the rate-distortion performance may be further improved by selecting pixels which are predicted with high confidence.

VII. CONCLUSION
In this work, we have discussed unexplored issues such as the impact of intensity initialisation on predictive accuracy and the influence of distributional shift in dual-layer prediction.
Experimental results have revealed that setting pixel intensity to zero, albeit seemingly arbitrary, renders a steadily low loss over several epochs. In addition, it has been found that training models in a causal way can, to some extent, ameliorate the distributional shift in deployment since it minimises the discrepancy in the distributions of training and test sets. The state-of-the-art predictive accuracy and steganographic rate-distortion performance can be achieved by applying advanced pixel-level computer vision models. We envision a promising paradigm shift in reversible steganography ush-ered in by deep learning and hope that this paper can prove instructive for future research.   CHANG-TSUN LI received the BSc degree in Electrical Engineering from the National Defence University, Taiwan, the MSc degree in Computer Science from the US Naval Postgraduate School, USA, and the PhD degree in Computer Science from the University of Warwick, UK. He is currently a Professor of Cyber Security at Deakin University, Australia. He has had over 20 years research experience in multimedia forensics and security, biometrics, machine learning, data analytics, computer vision, image processing, pattern recognition, bioinformatics and content-based image retrieval. The outcomes of his research have been translated into award-winning commercial products protected by a series of international patents and have been used by a number of law enforcement agencies, national security institutions and companies around the world, including INTERPOL (