Reversible Linguistic Steganography With Bayesian Masked Language Modeling

Text authentication serves a vital role in the defense of digital identity and content against various types of cybercrime. The use of a digital signature is a common cryptographic technique for text authentication. Linguistic steganography can be applied to further conceal a digital signature within the corresponding text to facilitate data management. However, steganographic distortion lurking in the text, albeit almost imperceptible, has the potential to cause automatic computing machinery to make biased decisions. This has led to an interest in the pursuit of reversibility, the ability to reverse a steganographic process and remove distortion. In this article, we propose a reversible steganographic system for natural language text. We use a pre-trained transformer neural network for masked language modeling and embed messages in a reversible manner via predictive word substitution. Furthermore, we derive an adaptive steganographic route by taking account of predictive uncertainty, which is quantified based on a theoretical framework of Bayesian deep learning. Experimental results show that the proposed steganographic system can attain a proper balance between capacity, imperceptibility, and reversibility with close semantic and sentimental similarities between cover and stego texts.


I. INTRODUCTION
A UTHENTICATION is an essential part of cyber security.
It is the process of validating the identity of users and the integrity of digital content. As cyber space continues to expand in scope and scale, authentication plays an important role in maintaining trust against various types of deception, including, but not limited to, impersonating identities, spreading spam, disseminating fake news, sending malicious links, and tampering with digital media. A digital signature is a mathematical proof of authenticity using modern cryptographic techniques such as encryption and hashing [1]. The incorporation of a timestamp and other tamper-evident designs can further strengthen security. Such auxiliary data, however, carry the risk of accidental loss and mismanagement during storage, transmission, or format transformation.
Steganography is the practice of concealing information (e.g. a secret message, a copyright mark, or a serial number) within a carrier object [2]- [4]. Steganographic  been applied to covert communications [5], ownership identification [6], broadcast monitoring [7], and traitor tracing [8]. It can also serve as an authentication solution by embedding auxiliary metadata in a digital file, thereby mitigating the risk of losing data. While steganographic distortion is often imperceptible to human sensory systems, it can bias decisions of automatic computing machinery. Previous studies of adversarial attacks have reported that machine learning models can be vulnerable to imperceptible perturbations [9]- [11]. Such distortion might also be inadmissible in some sensitive circumstances such as forensic science, legal proceedings, medical diagnosis, and military reconnaissance. Reversibility is the key to removing steganographic distortion. Reversible computing describes the notion that a computational process can to some extent be time-reversible. Most reversible steganographic methods are designed for digital imagery [12]- [19], whereas methods for textual data remain relatively undeveloped. A possible explanation is that many well-developed tools are available for exploiting the redundancies in visual signals on which steganographic methods rely. In contrast, manipulating natural language texts can be much more challenging, considering that even the tiniest change in a character can be discernible to a careful reader. With the worldwide popularization of social media and the technological advances in natural language processing (NLP), textual data have become an important source of information. Hence, reversible steganography for textual data has emerged as a promising research field.
Typographical steganography is a methodology that embeds messages by manipulating the typeface, spacing, font size, or other typographical characteristics of texts [20]- [24]. It treats text documents as a special type of imagery and is usually applied to printed copies. However, portability and robustness are restricted because this class of method is unable to withstand retyping and font changes. Linguistic steganography deals instead with natural language per se and is thereby able to resist such text processing. It exploits linguistic knowledge and conceals messages by modifying linguistic properties, ideally without altering the sentence semantics or degrading sentence fluency. Linguistic steganography can be broadly categorized into the following classes: lexical class, syntactic class, and generative class. A typical lexical method is synonym substitution, which uses different synonyms of a word to represent different message digits [25]- [27]. In principle, words belonging to the same synonym set have similar meanings and thus can be substituted for This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ each other without causing notable semantic change. Syntactic methods are based on the fact that a sentence can be transformed into semantically equivalent syntactic structures such as active-to-passive voice conversion, subject-verb inversion, and topicalization [28]- [31]. Generative methods consider further that a sentence can be translated into a wide spectrum of forms or completely rewritten subject to minimal change to the semantics. A feasible approach is to use a set of machine translators to generate multiple translations of a given sentence [32]- [35]. Controllable natural language generation has also emerged as a promising direction for generative-level linguistic steganography [36]- [40]. Although these linguistic methods find their applications in areas such as secret communications and ownership identification, none of them meets the reversibility requirement for distortion-free text authentication.
In this article, we study reversible linguistic steganography. We apply a masked language model to generate a list of predictive words (Section II) and embed messages via predictive word substitution (Section III). The underlying principle is that most words can be retrieved within a finite number of predictive words, and this type of redundancy can be leveraged for reversibility. The performance of the steganographic system is associated with the accuracy of the predictive model. We further observe that the carrier words can be selected more efficiently by exploiting predictive uncertainty (Section IV). To determine an efficient route for message embedding, we study uncertainty quantification based on a theoretical framework of Bayesian deep learning (Section V).
The remainder of this article is organized as follows. Section II provides a brief overview of masked language modeling. Section III introduces a practical method of reversible linguistic steganography. Section IV offers a further discussion on steganographic routing. Section V presents Bayesian uncertainty quantification. Section VI shows the experimental results for the proposed systems. Concluding remarks are offered in Section VII.

II. MASKED LANGUAGE MODELING
Masked language modeling is a fill-in-the-blank task (a.k.a. a cloze test), where a model attempts to predict a masked word from the surrounding context words. Tokenization is a preliminary step whereby a text sequence is split into tokens (e.g. words, punctuation marks, and numbers). A masked language model takes an input sequence that contains one or more mask tokens and estimates the probability distribution of each mask token over the entire vocabulary.
Given a sequence of tokens, a fundamental issue across nearly all NLP tasks is how to represent tokens as numerical input in a computational model. The arguably simplest representation is the one-hot vector. It is a binary vector of V elements, with all elements set to 0 except the element at the index of the corresponding word in a dictionary, where V is the size of the vocabulary. This type of denotational representation is sparse and treats each word as a completely independent entity; hence, it cannot capture latent word semantics. In linguistics, the distributional hypothesis suggests that words tend to have similar meanings if occurring in similar contexts [41]. This gives rise to the notion of distributional representation. For a given corpus, a co-occurrence matrix of size V × V is computed by counting the number of times each word appears within a context window around the word of interest. Words with similar co-occurrence patterns are expected to have similar meanings. However, the matrix is of high sparsity due to the curse of dimensionality. While singular value decomposition (SVD) can be applied for dimensionality reduction, it has a disadvantage in terms of scalability. This algorithm is computationally expensive for large matrices and has to be performed again from scratch if the co-occurrence matrix is updated by incorporating new terms or corpora. A modern approach toward representation learning is to train a neural network model on a certain task (e.g. language modeling) with each word vector initialized randomly, and then update word vectors iteratively using a stochastic optimization algorithm (e.g. gradient descent). A common limitation of early word vectorization models (e.g. word2vec [42] and GloVe [43]) is their inapplicability to polysemy and homonymy. Multiple meanings of a word are conflated into a single indistinguishable representation regardless of the context within which the word appears.
Contextual word representation is an advanced concept in which each word vector can be adapted dynamically to its context [44]. The BERT model (standing for bidirectional encoder representations from transformers) is a state-of-theart neural network developed by Google for learning contextual word representation [45]. The model is based on the transformer architecture which comprises multiple layers of self-attention modules and processes an input sequence bidirectionally [46]. To learn contextual word representation, the model is trained to perform masked language modeling. This task forces the model to learn more about contextual information. The knowledge learned from masked language modeling can then be transferred to tackle other downstream NLP tasks (e.g. machine translation, sentiment analysis, topic categorization). This is done by fine-tuning: choosing relevant layers in the pre-trained model, adding task-specific layers on top of the model, and training the model on new data. There are a variety of ways to use a pre-trained BERT model, and transfer learning is still a matter of ongoing research. Nevertheless, our steganographic system relies only on a basic function of the BERT model, namely, masked language modeling. The architectural details of the BERT masked language model are illustrated in Fig 1.

III. REVERSIBLE LINGUISTIC STEGANOGRAPHY
Reversible linguistic steganography considers the following scenario. A sender wishes to communicate a message to a receiver. The message is concealed in a carrier sequence of text, causing recoverable distortions to the carrier sequence. We refer to the original sequence as cover and its distorted counterpart as stego. The message varies with specific applications and is generally assumed to be a random binary sequence. The objective is to assure the accuracy of message extraction and text recovery, while keeping steganographic distortion as low as possible. The proposed method is based on masked language modeling, as depicted in Figure 2. To begin with, we mask a cover word in a text sequence and form a masked sub-sequence by taking the mask token and a fixed number of context words on the left-hand and right-hand sides. The masked sub-sequence is then fed into a pre-trained language model to obtain the probability distribution of the masked cover word. We sort the probabilities in descending order and derive a list of predictive words. The core idea is to substitute the cover word with a predictive word (or, more precisely, map the index of the cover word to another index) to represent a message digit.
Let us denote by x the index of the cover word in the list, where x ∈ N 0 is a non-negative integer. We set a bound for the indices such that x must be less than the bound. If not, we skip the current word and proceed to the next cover word. We further set a threshold θ to separate the carrier and non-carrier indices. The carrier indices constitute a finite set of size θ . To achieve reversibility, encoding must be a bijective function (i.e. a one-to-one correspondence). The number of unique combinations between all possible carrier indices and message digits is given by the Cartesian product of the two sets. Since all possible message digits form a binary  set, message embedding doubles the size of the carrier set and causes an overlap between carrier and non-carrier indices. To avoid ambiguity in message extraction, the non-carrier indices have to be shifted outward. Index shifting can induce out-of-bound indices. The set of out-of-bound indices is also of size θ . These indices are kept unshifted, and one flag bit is required to distinguish between a shifted and unshifted non-carrier index in the ambiguity interval [(bound − θ), (bound − 1)]. The flag bits are regarded as an overhead of reversibility. At first glance, it seems that index shifting only acts to transfer the ambiguity interval from one end to another and serves little if any purpose. Yet, the overhead size is reduced substantially because the indices which are close to the bound rarely occur. Given a functional predictive model, we may reasonably assume that the frequency of an index value follows an exponential distribution (rather than a uniform distribution); that is, a smaller index occurs more frequently and vice versa. An illustration of carrier interval, non-carrier interval, and ambiguity interval with different θ settings is shown in Fig. 3. It can be seen that the maximum value of θ is equal to one-third of the bound. When θ is greater than this update flag list value, the carrier interval overlaps with the ambiguity interval and there is no point in embedding one message bit at the cost of recording one flag bit. A practical coding method is presented as follows. Based on the assumption about index frequency, we allocate a smaller amount of distortion to a smaller carrier index when embedding a digit. Let m be a binary message digit to be embedded. If x is within θ , we encode x and m into a stego index; otherwise, we shift x by θ If the stego index is out of bound, we reset it to its original value and record the cases by a flag bit Decoding is operated in a first-in last-out manner (i.e. in reverse order of encoding). In the decoding phase, a message bit is extracted by and an index is recovered by If x is in the ambiguity interval, we read a flag bit to determine its original value. Pseudo-codes for the encoding and decoding procedures are provided in Algorithms 1 and 2.

IV. STEGANOGRAPHIC ROUTING
Steganographic routing is the process of selecting a path for embedding a payload. Different paths can lead to different update reversed message list trade-off curves between capacity and distortion. Routing is particularly important in the case of a limited payload size because an optimal path can minimize distortion subject to a given payload constraint. There are basically two types of routing: static routing and dynamic routing. Static routing uses a default or manually configured path pre-shared between the encoder and the decoder. It is easy to implement, but cannot minimize distortion. Dynamic routing, on the other hand, constructs an adaptive path that reflects the degree of distortion caused by modifying each word. Recall that our coding design introduces a smaller degree of distortion to a smaller index and greater distortion to a larger index-that is to say, the degree of distortion is inversely proportional to predictive accuracy. Therefore, optimal routing is to select a path in descending order of predictive accuracy. While an optimal path is computable at the encoder, it cannot be reproduced at the decoder. The reason for this is that the word sequence used to derive the path is inconsistent with the word sequence received. A path is represented by a long sequence of digits, and storing such auxiliary information would be impractical. The problem of designing a dynamic path that is computable for both the encoder and the decoder is an intriguing one. The most straightforward way to deal with textual data of a sequential nature is sequential routing. However, steganographic distortion imposed upon the preceding words would propagate, thereby impairing predictive performance on succeeding words. In other words, the contextual clues from the past are distorted and only the clues from the future remain intact. Introducing randomness is conducive to mitigating error propagation. Words are randomly selected for carrying the payload, thereby reducing the chance of encountering distorted context words. Furthermore, a random seed for initializing a pseudorandom number generator can serve as a secret key to enhance security. Nevertheless, the random variation on sequential routing is still a form of static routing and is not necessarily optimal. As previously mentioned, the optimal path is associated with predictive accuracy. The encoder and the decoder cannot derive the same optimal path because predictive accuracy is self-dependent-the accuracy of predicting a target word is related to the word itself. The target word must be kept unchanged to derive the same quantity; however, it has to be changed to carry information. These two objectives are mutually incompatible.
Predictive uncertainty is a concept closely related to predictive accuracy and depends purely on the context words. We can derive an alternative path in ascending order of uncertainty, assuming uncertainty is quantifiable. The synchronicity between the encoder and the decoder has to be ensured so that they can compute the same degree of uncertainty for each target word. For that reason, the context words have to be kept unchanged. This can be implemented by sampling target words at fixed intervals of context words. For instance, the context/target words are arranged in the following manner: a target word, a segment of context words, a target word, a segment of context words, and so forth. If the intended payload is beyond the capacity offered by the selected target words, we can select another set of target words by simply shifting the intervals, resulting in multi-level message embedding. The maximum number of levels is equal to the length of the context segment plus one. For each level, we can construct an adaptive path in ascending order of uncertainty. We refer to this method as parallel routing in the sense that a route is dynamically computed in each parallel level. Randomness can be introduced by randomly selecting a context/target pattern for each level. A variety of routing methods are illustrated in Fig. 4. Furthermore, we can set up an empirical threshold τ for filtering out words of high uncertainty to improve performance. In other words, when the prediction of a word is perceived to be highly uncertain, we keep the word completely intact to minimize distortion.

V. BAYESIAN UNCERTAINTY QUANTIFICATION
Most deep learning models are deterministic functions which offer only predictions without uncertainty information. Bayesian statistics offers a probabilistic interpretation of deep learning models from which the underlying uncertainty can be captured [47]. For a given masked sequence s and a training set D, the predictive distribution of the masked word y is given by where denotes the model parameters. This can be interpreted as the average prediction over all plausible parameter settings according to the parameter posterior. For deep learning models, the derivation of the parameter posterior is analytically intractable; hence, we resort to variational inference to approximate the posterior by a variational distribution q( ), which belongs to a family of distributions of simpler form. By substituting the parameter posterior with the variational distribution and approximating the integral with Monte Carlo integration, we derive that whereˆ t ∼ q( ). Sampling model parameters from a variational distribution can be interpreted as dropout [48], which is a stochastic process of multiplying the output of each neurone by a random variable drawn from a Bernoulli distribution. Each dropout configuration corresponds to a plausible realization of a deep learning model with a portion of neurones deactivated. Applying T different dropout masks to the model is equivalent to performing stochastic forward passes for T repetitions, resulting in an ensemble of sparse neural network models. This process can be viewed as a proxy for a probabilistic deep learning model and is referred to as Monte Carlo dropout [49]. The predictive distribution is derived by averaging the likelihoods from T stochastic forward passes for each word where w i denotes the i th word in the dictionary. For masked language modeling, the likelihood is given by the normalized exponential function or softmax function where f i denotes the i th logit (i.e. raw prediction) from the model. In information theory, the uncertainty underlying a predictive distribution can be measured by Shannon entropy [50]  Entropy is an encapsulation of information. It captures the average amount of information in predictive distribution. It is maximized when the predictive distribution is a uniform distribution; in other words, the model demonstrates the maximum uncertainty when each word is equally likely. It is minimized when only one word has a probability of 1 and all other words have a probability of 0. An overview of Bayesian uncertainty quantification is illustrated in Fig. 5.

VI. EXPERIMENTS
We evaluate the proposed stego system with different settings of bound and threshold and compare different routings of interest. The trade-off between capacity, imperceptibility, and reversibility is examined with additional analysis on the semantic and sentimental similarities between cover and stego texts. Furthermore, a discussion on possible improvements is provided.

A. Experimental Setup
Our cover text consists of 8 selected paragraphs from a work of classic English literature, Alice's Adventure in Wonderland. The text contains 711 words plus punctuation. Each letter is made lower case. The BERT model has a vocabulary size of 30 522 tokens. The number of context words on each side of the target word (i.e. the length of the context segment) is set to 32 so that each input masked sub-sequence has 65 words. The number of Monte Carlo dropout samples (i.e. stochastic forward passes) is set to 1000. The predictive accuracy of the pre-trained BERT model can be represented by probability distribution function (PDF) and cumulative distribution function (CDF) of the word index, as shown in Fig. 6. It suggests that more than 60% of words can be accurately predicted and about 90% of words are among the top 25 predictions in the case of no steganographic distortion.

B. System Evaluation
We evaluate our steganographic system with different settings of bound and θ and analyze the trade-off between capacity, imperceptibility, and reversibility. Capacity is measured by the number of payload bits (absolute value) and the payload bits per word (relative value). Imperceptibility is measured by the cosine similarity between the cover sequence and the stego sequence in the vector representation space. Reversibility is not an all-or-nothing proposition. We quantify reversibility by the number of flag bits used to disambiguate colliding word indices (the lower the better). Figs. 7 and 8 show, respectively, the capacity-imperceptibility curves and the capacity-reversibility curves from different routing methods. There is a general trend of decreasing imperceptibility and reversibility with increasing capacity. For the same threshold (θ = 1), a smaller bound preserves a better similarity because there are fewer non-carrier words to be shifted. On the other hand, a smaller bound incurs more flag bits because ambiguous indices, which are close to the bound, appear more frequently. When the threshold is raised to the maximum (θ = bound/3), the reachable capacity is increased. In addition, increased imperceptibility is obtained at the expense of lower reversibility. A comparison between parallel routing and sequential routing confirms the advantage of dynamic strategy over static strategy. The parallel routing takes account of predictive uncertainty and constructs an adaptive path for message embedding. The results also suggest that the random method has a positive effect on both the routing methods. Furthermore, when a few carrier words are filtered out by their uncertainty magnitudes (with a threshold τ ), the reachable capacity is reduced and yet greater imperceptibility and reversibility are achieved. A more advanced uncertainty analyzer is expected to further improve performance. We would also like to point out that the current uncertainty analyzer requires computationally expensive Monte Carlo dropout. To facilitate real-time applications, a more efficient way to estimate uncertainty has yet to be developed.

C. Semantics Analysis
Figs. 9 and 10 display a part of the cover text and the corresponding stego text, along with the word clouds for visualizing the frequency distributions of cover words and stego words. The stego text is generated by the random parallel routing with θ = 1, bound = 270 and τ = 1. We set the capacity to 0.3 bits per word so that more than 200 payload bits are embedded. This number is arguably sufficient for Fig. 9. Word cloud and sample paragraphs of cover text with highlighted carrier words (in blue) and non-carrier words (in green).  many authentication applications. Perfect reversibility is also guaranteed without any overhead information. We can observe that the cover text and stego text are similar in terms of the semantics and the frequency distribution of words. However, closer inspection of the stego text shows that there are some unnatural word usages and grammatical mistakes. A possible refinement may be made by filtering out grammar words and named entities and manipulating content words only. Furthermore, a carefully designed word checker could be used to regulate the manipulations.

D. Sentiment Analysis
We carry out a sentiment analysis on a stego text generated using the aforementioned configurations. Fig. 11 reveals the positive/negative sentiment scores for each cover paragraph and each stego paragraph. The scores are obtained from a transformer-based sentiment analyzer. It is observed that the cover text and the stego text have very similar sentiment patterns, suggesting that steganographic distortion only produces minimal fluctuations in sentence sentiment. For particular sentiment-oriented applications, one may refine the system by retaining some salient contributory words which have a dominant influence upon text sentiment.

VII. CONCLUSION
In this work, we introduce a linguistic stego system with reversibility based on predictive word substitution. We use a pre-trained masked language model to generate a list of predictive words and embed a message digit by replacing the target word with one of the predictive words. The underlying assumption of the reversible coding is that word indices follow approximately an exponential distribution. We further apply a theoretical framework of Bayesian deep learning to quantify the uncertainty in the masked language model and use it to determine an adaptive route for message embedding. Our stego system achieves perfect reversibility without extra auxiliary information under limited capacity conditions. It also maintains close vector space, and semantic and sentimental similarities between cover and stego texts. Imperceptibility analysis suggests that steganographic distortion is to some extent indiscernible in a computing sense. In reality, however, even an extra punctuation mark or unusual collocation may be noted by a careful reader. Therefore, further improvement in imperceptibility is required. We also envisage further progress in uncertainty analysis such that the computational efficiency meets real-time requirements. Furthermore, while the proposed stego system is based primarily on lexical substitution, syntactic and generative methods also deserve investigation. We hope this article can shed light on future research devoted to reversible linguistic steganography.