Discrete Prompt Compression With Reinforcement Learning

Compressed prompts aid instruction-tuned language models (LMs) in overcoming context window limitations and reducing computational costs. Existing methods, which are primarily based on training embeddings, face various challenges associated with interpretability, the fixed number of embedding tokens, reusability across different LMs, and inapplicability when interacting with black-box APIs. This study proposes prompt compression with reinforcement learning (PCRL), which is a discrete prompt compression method that addresses these issues. The proposed PCRL method utilizes a computationally efficient policy network that edits prompts directly. The training approach employed in the proposed PCRLs can be applied flexibly to various types of LMs, including both decoder-only and encoder-decoder architecture and it can be trained without gradient access to the LMs or labeled data. The proposed PCRL achieves an average reduction of 24.6% in terms of the token count across various instruction prompts while maintaining sufficient performance. In addition, we demonstrate that the learned policy can be transferred to larger LMs, and through a comprehensive analysis, we explore the token importance within the prompts. The source code is available at https://github.com/nenomigami/PromptCompressor.


Introduction
Instruction-tuned language models (LMs) (Wei et al. 2021;Ouyang et al. 2022;Sanh et al. 2022), e.g., ChatGPT, are being used increasingly to address various natural language processing (NLP) challenges, offering solutions through task-specific prompts for both individuals and businesses.The design of concise prompts that contain only essential information benefits both users and servers.For example, Users benefit from reduced query-length dependent API usage costs and overcoming context window limitations, and servers benefit from shorter prompt designs that reduce computational burden.Prompt compression methods for concise, information-rich prompts are beneficial in terms of realizing efficient LM utilization.
A widely adopted prompt compression method involves training embeddings that encapsulate the original contexts (Wingate, Shoeybi, and Sorensen 2022;Mu, Li, and Goodman 2023;Chevalier et al. 2023), using the soft prompt concept (Lester, Al-Rfou, and Constant 2021).However, with this method, the appropriate embedding token count must be determined, its inherent properties can hinder interpretation, it lacks cross-model reusability, and its dependency on gradient access to LMs can make it impractical for scenarios that employ API services.An appealing alternative is compression via discrete prompts that comprise concrete tokens from the vocabulary.Only a few studies have investigated methods to compress discrete prompts.One such study, the selective-context by (Li 2023), focuses on reducing prompt length by filtering out less informative text based on selfinformation from an entropy perspective.
In this paper, we proposed the prompt compression with reinforcement learning (PCRL) method that utilizes a discrete prompt compression technique that incorporates the advantages outlined in Table 1.Drawing on techniques similar to those used for extractive summarization tasks, the learned policy edits prompts directly, which reduces tokens with limited contribution to the LM output (i.e., the generation LM).To reduce the computational overhead associated with the compression process, we designed a process that determines the inclusion or exclusion of each token simultaneously in a single step.In addition, the policy integrates MLP layers with a small number of parameters into lightweight LMs (i.e., the policy LM), which improves computational efficiency further.
The model is trained by a reward function that balances both the faithfulness of the compressed prompts and their reduced length using a policy gradient algorithm (Sutton et al. 1999).Here, faithfulness is evaluated indirectly by measuring the similarity between the output of the generation LMs when given uncompressed and compressed prompts.This approach allows us to train the policy without the gradients of the LMs and ensures effective learning even in the absence of data labels.In addition, this enables consistent training regardless of whether the generation LM has a decoder-only or encoder-decoder architecture.
The proposed model achieved an average compression ratio of 24.6% in experiments conducted on various instruction sets while maintaining output quality that is similar to that of the original prompts.In addition, we analyzed the importance of tokens for the response and the results provide insights that could be used to further refine and optimize the compression technique.Furthermore, we found that the policy learned from a smaller model can potentially be transferred to larger and more powerful generation LMs.Our model demonstrates transferability between various LMs by using discrete tokens rather than embeddings (Section 4.3).
The primary contributions of this study are summarized as follows: • We propose the discrete prompt compression concept and describe the problem using RL.• We demonstrate the superior performance of the proposed PCRL method compared to existing methods and the transferability of the learned policy to more practical LMs.• We explore the token characteristics within the prompt that yield minimal contribution to the LM output.
2 Related Work

Discrete Prompt Optimization
Prompting has been widely used as a general method for NLP tasks (Brown et al. 2020;Schick and Schütze 2021;Sanh et al. 2022), and corresponding research into prompt optimization in LMs has emerged as a significant area of study.For example, prompt tuning optimizes continuous embeddings using gradient descent (Lester, Al-Rfou, and Constant 2021;Liu et al. 2021).In contrast, discrete prompt optimization searches for tokens or exemplars to construct effective prompts.A previous study by Shin et al. (2020) utilized gradient information to search for the best performing prompt, and another study (Prasad et al. 2023) proposed an edit-based search method that is applicable to gradientfree scenarios.In addition, Zhou et al. (2022) leveraged LMs to generate and evaluate prompts.Deng et al. (2022) introduced an RL-based framework to generate optimal prompts and improve LM performance.Zhang et al. (2022) integrated various prompt components, including exemplars and the verbalizer, which were optimized using RL.These studies have made remarkable progress; however, they focused on enhancing performance, largely neglecting the prompt compression perspective.

Prompt Compression
In the prompt compression research field, the majority of the studies adopt a soft prompts concept.Early studies set distillation objectives to minimize the discrepancies between the generative distributions produced by LMs using original prompts and those produced using soft prompts (Wingate, Shoeybi, and Sorensen 2022).However, this technique requires re-optimization for each new prompt, thereby lacking the capability to generate compressed prompts for different prompts.Mu, Li, and Goodman (2023) decomposed a prompt into tasks and inputs, which effectively reduced the task component to a few gist tokens.The proposed method differs in that it attempts to compress the entire prompt.Chevalier et al. (2023) focused on overcoming limited context windows using compressed summary vectors from long contexts.Similar to our work, Li (2023) removed less informative content from discrete prompts by calculating selfinformation based on the likelihood of tokens.However, this method is dependent on having access to probability information, which is unfeasible in black-box API scenarios.

Unsupervised Summarization
A different perspective of the proposed study involves unsupervised summarization to create more concise prompts.Specifically, we select an extractive summarization over abstractive methods to reduce the search space and maintain closer context with the original prompt.Zhou and Rush (2019)  3 Prompt Compression with RL

Task
Here, given a prompt p = {x 1 , x 2 , ..., x n }, comprising tokens x i , a compressed prompt p ′ is defined as a shorter sequence of tokens.When input to LMs, it produces a generative distribution P LM (•|p ′ ) that is similar to that obtained by the original prompt P LM (•|p).The output sequence of tokens is denoted y, and the function δ quantifies the divergence between the distributions.The compressed prompt should satisfy the following condition.
The primary objective of this study is to learn a policy π that compresses a given original prompt p as much as possible.When applied to a prompt p, this policy generates a shorter prompt p π = π(p) that retains the semantic information of p.We cast this problem as a sequence labeling task to select salient tokens from the prompt.In this context, an include/exclude label is assigned for each token x i , thereby creating a compressed prompt that encompasses only the required tokens.The optimization objective of this policy combines two terms, i.e., faithfulness and the compression ratio, using the balance term β.
Typically, common methods that use the soft prompt fix the token length of the compressed prompt as a hyperparameter and minimize the divergence δ as a loss through gradient descent.However, challenges arise when practitioners interact with LMs via an API or when computing the gradients becomes excessively costly.This frequently makes it unfeasible to access the probability distribution of output tokens P LM (•|p) and the gradient information directly.To overcome this specific challenge, we reformulate the problem using RL, by leveraging optimization without the LM gradient.In addition, we replace the measure of divergence between the output distributions P LM (•|p) with a measure of similarity of the output sequences y = LM (p).In addition, we adopt the ROUGE score to compute similarity in the proposed model.

Training Procedure
The construction of the compressed prompts is formulated as a discrete prompt optimization problem, which we address using RL.To accomplish this, we set up the following Markov decision process (MDP).Given an initial state, i.e., tokenized prompt p = {x 1 , x 2 , . . ., x n }, the policy π outputs binary labels as actions a = {a 1 , a 2 , . . ., a n } ∈ {0, 1} n for each token.Here, each label a i determines whether the corresponding token is included or excluded.Although this method may yield grammatically inconsistent prompts, a recent study suggests they could be more effective (Deng et al. 2022).Following the transition to a compressed prompt p π , a reward R(p, a) is received.This reward is calculated from the output sequences of the LMs and the reduced prompt length.Note that the MDP terminates in a single step, thus, our environment resembles a contextual multi-armed bandit (Lu, Pál, and Pál 2010).In contrast to the traditional bandit problem, in which only a single action and its corresponding reward are available in each episode, our algorithm allows the policy to obtain rewards with multiple possible actions.prompt pool, which is a dataset of prompts that do not require labels.The sampled prompt is processed through the compression policy π θ to produce a compressed prompt p π .The original and compressed prompts are then input to the LMs, yielding two sets of output responses.Then the reward is calculated based on the measured similarity and the compression ratio of p π .To balance accuracy and time efficiency during the generation process, we limit the number of generated tokens to T .Note that a longer and more timeconsuming generation process could offer more accurate understanding of the similarity, However, empirical findings indicate that even a partial generation is sufficient.The compression policy π θ (parameterized by θ) is trained using the policy gradient algorithm.This process ensures that, given an input prompt p, the policy will yield a probability distribution of binary actions a i for each token.Here, the objective is to identify the parameter θ that causes π θ (a|p) to assign a high preservation probability to tokens that convey the essence of the prompts, which is accomplished by maximizing the following objective function in relation to the parameters θ.
where, π θ stands for π θ (a|p).The policy gradient algorithm possesses the following gradient: Note that we subtract a baseline from the reward to facilitate effective learning by adopting Self-critical sequence training (SCST) (Rennie et al. 2017) the action a sampled from the current policy π(•|p).Conversely, the baseline R(p, â) is derived by executing the action â with the highest probability in the current policy.
In simpler terms, this implies that an action is considered preferable if it offers a reward that is greater than that predicted by the current policy.Both the ROUGE scores and compression ratio, which are used as reward functions, are positive; thus, it is necessary to penalize actions that yield relatively lower rewards.Incorporating the baseline helps us deal with this concern effectively.
A limitation of SCST occurs when the two sequences achieve comparable rewards R(p, a) ≃ R(p, â), i.e., the loss approaches zero, and the model has little to learn, thereby wasting a training sample (Laban et al. 2021).Thus, we enhance the learning process by sampling k actions from the current policy for the same prompts and calculating the average rewards.In addition, to reduce instances where the loss is near zero, we implement an entropy term to the loss, which increases the probability of sampling diverse actions (Haarnoja et al. 2017).We then train the model by minimizing the following loss function.
Here, the temperature parameter α determines the significance of the entropy term.

Model Architecture
Fig. 2 shows the architecture of the policy network π θ .Here, We attach binary classification MLP layers to a frozen pretrained policy LM, which is used to extract the contextual embeddings of the tokens.A primary motivation behind compressing prompts is the need to reduce computational costs, leading us to favor efficient, smaller-scale backbone models, e.g., DistilBERT (Sanh et al. 2019).During training, only the parameters in the attached simple MLP layers are updated.We use action masks to prevent the policy from excluding statement tokens (e.g., "Instruction: " and "Input: ") to ensure that the compression ratio reflects the actual reduction of the prompt rather than simply removing these statement tokens.In addition, the policy LM does not necessarily have to be the same as the generation LM for which we optimize the prompt.

Reward Design
Note that the reward function must balance two potentially conflicting terms, i.e., faithfulness and reduced length.To account for faithfulness, we define a term based on the ROUGE-L score of two output token sequences generated from the original prompt p and the compressed prompt p π .
The ROUGE-L score considers sentence-level structural similarity; thus it is suitable as a faithfulness term.To reflect the reduced length, we use the compression ratio, which is the proportion of the reduced token count to the original token count in the prompts.The final reward is given as follows.
If the ROUGE-L score exceeds a certain threshold τ , the model receives the compression ratio as the reward; however, if the score does not exceed threshold τ , the model receives a penalty λ.
A key difference between the proposed method and typical RL-based summarization (Laban et al. 2020; Ghalandari, Hokamp, and Ifrim 2022) is that we do not consider grammatical correctness.Recent studies (Webson and Pavlick 2022;Prasad et al. 2023;Deng et al. 2022) have suggested that LMs leveraging prompts do not necessarily adhere to human language patterns.Interestingly, prompts that yield high performance tend to be gibberish without a clear human-understandable meaning.Thus, we do not incorporate grammatical fluency into the reward function.In fact, this aspect facilitates the potential to acquire shorter prompts.

Experiments
Through a series of experiments, we demonstrate that the proposed PCRL method compresses prompts successfully regardless of the type of the generation LMs.In these experiment, we fine-tuned the LMs using a diverse set of instruction data to mimic off-the-shelf instruction-tuned LMs.We then evaluated the performance of the compressed prompts obtained by the PCRL method on a validation instruction set.In addition, the experimental results demonstrate that the transferability of the compression policy across LMs allows us to learn from smaller models in a cost-effective manner and apply it to larger, more powerful models.

Instruction Prompts
Datasets To construct LMs that can be generalized across various instructions, we used the Alpaca+ dataset, following a previous study (Mu, Li, and Goodman 2023).The Al-paca+ dataset consists of a Self-instruct (Wang et

Models
In these experiments, we employed two different architectures to demonstrate that the proposed method can be applied to various text generation LMs.The first LM is GPT2-XL (Radford et al. 2019), which is a decoder-only model, and the second LM is FLAN-T5-XL (Chung et al. 2022), which is an encoder-decoder model.These LMs include 1.5B and 3.0B parameters, respectively.Each of these models was fine-tuned on the Alpaca+ dataset with three epochs for GPT2-XL and one epoch for FLAN-T5-XL to create instruction-tuned models for inference.The performance achieved with noncompressed prompts, which are used as the upper-bound baseline original, is the standard for evaluating our models.Several approaches were considered for comparison.including the basic technique of eliminating less informative tokens (specifically stop words) using the NLTK stop word list (Bird, Klein, and Loper 2009).In addition, we compared the proposed model's effectiveness with that of the selectivecontext method (Li 2023).To ensure fairness in the comparison, we configured the model to perform compression at the token level with similar compression ratios and maintained the inclusion of statement tokens.We evaluated both the foundation model and the instruction-tuned model to calculate self-information, and we reported the best obtained results.
Evaluation The evaluation metrics used to assess model performance included ROUGE-L, ChatGPT, and compression metrics.ROUGE-L has been used in instruction fine-tuning (Wei et al. 2021;Wang et al. 2022b) and prompt compression (Mu, Li, and Goodman 2023) studies.ROUGE-L calculates the similarity between the ground truth (Gt) and the generated response (Gen) by measuring F1 score of the longest common subsequence (LCS).This similarity is quantified using the following formulas: It is important to distinguish this usage from that in the reward calculation.The reward function employs ROUGE-L to calculate the score by comparing the sentences generated from the original and compressed prompts; however, during evaluation, it represents the similarity to the true reference in the dataset.GPT2-XL tends to continue generating tokens until it reaches the maximum token limit; thus, we generate tokens up to the number of tokens in the reference sentences for both models.The compression ratio (Cr) is the reduced token count in the compressed prompt divided by the token count in the original prompt.
To ensure fairness, we calculate Cr by excluding the number of statement tokens.This ratio signifies the model's effectiveness in terms of condensing the original prompt.Due to potential differences between the tokenizers used by the policy and the generation LMs, we employ the decoded text as a bridge.Here, tokens are edited on the basis of the policy LM's tokenizer, and the Cr is calculated using the generation LM's tokenizer.The ChatGPT metric represents the ratio by which Chat-GPT selects the better response between two options for a given task.Here, the objects of comparison are the responses to our model's compressed prompt and the original prompt.The ChatGPT metric can be used as a supplement because it can consider more semantic elements than the ROUGE-L metric.If the compressed prompt is similar in meaning to the original prompt, a result approximating 50% is expected.This metric is considerably faster and more costeffective than human evaluation; however, it exhibits nearly the same performance as human annotators in instructionfollowing tasks (Mu, Li, and Goodman 2023).In addition, the near-human performance of ChatGPT in text annotation and evaluation (Gilardi, Alizadeh, and Kubli 2023;Huang, Kwak, and An 2023;Wang et al. 2023), lends credibility to this measure.The prompt given to ChatGPT follows precisely that described in the literature (Mu, Li, and Goodman 2023) without any additional prompt engineering.

Results
The experimental results for the instructionfollowing tasks on the entire validation set are shown in Table 2.As can be seen, the proposed model outperformed the compared methods on all validation sets.For the GPT2-XL model, our compression policy achieved performance similar to that of the original prompts' ROUGE-L scores and the ChatGPT metrics across most validation sets.This was achieved while also reducing the number of input tokens by an average of 22.7% for GPT2-XL and 26.4% for FLAN-T5-XL.In the human split set, both the ROUGE-L scores and the ChatGPT metrics exhibited lower overall values.In this split, it appears that the OOD challenge makes it difficult for the policy to compress considering the context.

Transferring Prompts across LMs
A unique advantage of discrete prompts over soft prompts is that they are transferrable across models because of the common text space rather than the model-specific latent space (Deng et al. 2022).Leveraging this advantage, we demonstrate the practicality of the proposed model by experimenting with its application to larger, more powerful generation LMs.The results of this experiment effectively prove that the proposed method's use of discrete prompts enables higher flexibility and robustness, thereby making it a valuable tool in various scenarios and across different models.
Experiment We evaluated the transfer ability of the proposed method using 2,252 data points, which is the sum of all validation sets used in the previous experiment.
Here, we considered four models, i.e., LLaMa2 (Touvron et al. 2023), which is a decoder-only model with 7B parameters, Falcon (Almazrouei et al. 2023), which is another decoder-only model with 7B parameters, FLAN-T5-XXL (Chung et al. 2022), which is an encoder-decoder architecture with 11B parameters and GPT-3.5 model, which is the LM used in ChatGPT.Specifically, we used the Llama-2-7B-chat-hf, Falcon-7B-instruct, FLAN-T5-XXL and gpt-3.5-turbomodels without fine-tuning.In line with our previous experiments, we compared the output of the original and compressed prompts using the ChatGPT metric.This allowed us to effectively assess how well the proposed method performs across different models, by showcasing its flexibility and potential for adaptation to various scenarios.
Results Table 3 shows the transfer results for compression policies applied to various large LMs.These policies were trained using GPT2-XL and FLAN-T5-XL as the generation LMs.As can be seen, the difference in the compression ratio due to variations in the tokenizers between the generation LMs was minimal.As a result, the Cr value was similar to that obtained in the previous experiments.Surprisingly, we found that the ChatGPT evaluation is generally consistent with the results of the original generation models, and in some cases, it even surpasses them.Specifically, LLaMa2 demonstrated a successful transfer with a win rate of 47.3% in the GPT2-XL model and 45.8% in the FLAN-T5-XL model.In addition, the performance of GPT-3.5 surpassed the result obtained by models used in training, achieving 49.8% in the GPT2-XL model and 47.7% in the FLAN-T5-XL model.
The level of stability emphasizes the viability of the proposed method, indicating its effectiveness even with updates to the API version or when an entirely different LM is used.
The results from LLaMa2 and GPT-3.5 suggest the possibility that the more powerful the model, the less susceptible it is to the influence of redundant tokens, thereby indicating a higher potential for compression.In addition, the performance of the FLAN-T5-XXL model lagged behind the other models, despite employing the same training procedure and tokenizer as the FLAN-T5-XL model.This variation may stem from the fine-tuning differences on the Al-paca+ dataset, causing a deviation from the performance observed with the original FLAN-T5-XL model.

Analysis
We applied the proposed model to the Alpaca+ training set, which comprises a total of 4.47M tokens, to identify the patterns of the excluded tokens.This analysis focused on the top 1,000 tokens based on appearance frequency from a total of 25,670 different tokens in the dataset.Table 4 shows the results of the top 20 tokens with the highest removal ratio (Removal Ratio) with their rank in terms of appearance frequency (Freq Rank).Here, the Removal Ratio value was calculated by dividing the number of times a token was removed by the number of times it appeared.The tokenization process was performed by the same tokenizer used in the When analyzing the edited prompts, we found that the categories of the eliminated tokens primarily belong to three main groups, i.e., stop words, punctuation, and endings.Table 4 includes several stop words, e.g., articles 'a' and 'the' and certain prepositions.Aligning with common sense, the indefinite article 'a' has a much higher ratio of being removed than the definite article 'the' which refers to specific things.In addition, punctuation marks ('.,' and '.') were deleted frequently.Endings, e.g., 'ify' in 'Identify' and 'ribe' in 'Describe' were removed at high ratios.
The following examples show actual compressed prompts, with the content inside parentheses having been removed by the compression policy.Despite these removals, the edited prompts remain interpretable.The following example displays most of the removed word belongs to stopwords, punctuation, and endings.''' Instruction: Ident(ify) (the) odd one (out)(.)Input: Twitter(,) Instagram(,) Telegram Output: ''' Even beyond the categories mentioned above, other words may be removed if the sentence still retains its meaning, however, elements in the input are removed infrequently.

Conclusion
This paper has proposed the PCRL method, which is a prompt compression policy technique that utilizes RL.By reducing the number of tokens in the input prompt sent to the LMs, we have overcome the limitations related to the context window, thereby reducing both inference time and API usage costs.The proposed model is trained using only a generation LM without the need for labeled data, and it requires only a small number of MLP layers within a frozen LM, thereby making it parameter efficient.Despite being trained on a smaller model, we have demonstrated the potential for transferring the proposed method to larger, more practical models.In addition, through further analysis, we have provided a deeper understanding of the individual tokens in the prompts that are input to the LM.

Limitations
To reduce inference costs while training the proposed PCRL, we fine-tuned LMs (i.e., the GPT2-XL and FLAN-T5-XL models) on instruction data and used them as the generation LMs.If off-the-shelf models that achieve instructionfollowing performance without fine-tuning processes could be used, a more practical compression policy and more convincing results would have been obtained.
A limitation of the proposed method lies in the use of the extractive compression method.The consideration of prompt meanings and sentence paraphrasing is expected to further reduce the number of tokens, and exploring this issue will be the focus of future work.
Additionally, our method holds the potential risk associated with editing the original prompts.Specifically, in cases where the original sentence must be directly referenced for rewriting, there could be erroneous outputs, and if the compressed prompt omits crucial information, it may trigger hallucinations.Moreover, the LM used in policy training also has a limited context length, which may restrict its use in compressing longer sentences.
Another limitation is related to the reward design, where the use of the ROUGE score as a faithfulness term has certain constraints.If the feasible responses in the probability space of the LM's response do not share similar words, a well-executed response may not receive a high reward.For example, if the task involves inventing a new game, and the compressed prompt suggests a variation of hopscotch, and the original prompt suggests a card game, both would have been well-executed.However, the faithful term value would be close to zero.In the future, this limitation may be addressed by implementing a reward design that considers semantics, e.g., a human preference function.

Figure 1 :
Figure1: Overall training procedure of PCRL.A prompt is sampled from the prompt pool, edited by the compression policy, and evaluated by comparing the generation LM's response to the original and edited prompt.The resulting reward is used for policy updates.

Figure 2 :
Figure2: The policy network of PCRL.When a tokenized prompt is inputted, the network outputs an include/exclude probability for each token.If a token is the part of a statement, the exclude action is masked out.
opened the door to find (a) tall figure cloaked in shadows(.)Output: ''' This is likely because many tasks have results that change even with slight variations in the input.Additional tables and examples are given in Appendix Section C.

Table 1 :
Comparison of the proposed model with soft prompt compression methods based on selected desirable properties.Generalization represents the characteristic that allows it to handle new prompts without requiring retraining.A model that is capable of adaptive compression adjusts the length of the compressed prompt according to the length of the original prompt.Black-box applicable methods can be applied in black-box API scenarios where gradient or token probability are not provided.

Table 2 :
(Taori et al. 2023)and ChatGPT performance of PCRL for instruction prompts.Values in parentheses indicate normalized scores to the Original.andaStanfordAlpaca(Taori et al. 2023)dataset.Specifically, it comprises (tasks, input, answer) tuples, with a total of 104,664 unique tasks, and it is effective for experiments involving a diverse set of instructions.The validation set in the Alpaca+ dataset is categorized into three distinct sets.The first set, Seen prompts, contains 1,000 prompts in which the tasks are already seen in the training set; how- ever, the inputs are new.The second set, Unseen prompts, includes 1,000 prompts where both the tasks and the inputs have never been encountered in the training set.The final set includes 252 handcrafted human prompts, thereby representing a substantial out-of-distribution (OOD) challenge.

Table 3 :
Transferability of the proposed PCRL method across different LMs, evaluated using the ChatGPT metric.Values in parentheses indicate the 95% confidence interval.