Intelligent Practices of Large Language Models in Digital Government Services

Large language models have been widely used in open-domain tasks with significant results, as well as being able to perform zero-sample closed-ended questions based on internal knowledge stored in the parameters during pre-training to answer the task. However, this internalized knowledge may be insufficient or the knowledge may be outdated in responding to government service consultation scenarios, which may result in the inability of the large language models to perform accurate and rigorous answers and provide effective assistance. This issue has attracted widespread attention, but there is a lack of datasets for relevant research. Therefore, in this paper, we take Beijing as an example to collect all kinds of common service counseling questions from its government website and the corresponding official answers as a dataset, which contains the daily counseling questions encountered by the citizens, including common questions about medical insurance, social insurance, provident fund flexible employment, and government-private interaction. Therefore, this paper designs a domain-specific language model (GCALLM) for government service consultation based on this scenario. By fine-tuning the large language models for knowledge injection, the fine-tuned model helps the large language models improve their performance in governmental service consulting scenarios by providing contextual information. And it solves the problem of not being able to answer precisely, allowing for more rigor and accuracy of the answers. In addition, the response information is answered in seven major national languages to improve the construction of digital government consulting services. A large number of experiments have proved that the model can produce accurate responses in this scenario in the field of governmental counseling.


I. INTRODUCTION
GPT [1], Google's PaLM [2], and other benchmark models [3], [4], [5], [6], [7] are prominent in the field of artificial intelligence.The pretraining on large amounts of data enables these models to have exceptional language understanding and generation capabilities [8].When it comes to domain-specific problems, large language models exhibit limited performance due to their insufficient pre-training on domain knowledge, and the overwhelming presence of domain-generalized data causes them to prioritize public The associate editor coordinating the review of this manuscript and approving it for publication was Alba Amato .knowledge, leading to potential oversight of critical domainspecific information [8], [9], [10].However, the scale and cost of training large language models from scratch are very costly for most companies and researchers.Finetuning existing models based on domain-specific data is another option, and several studies have shown efficient strategies for achieving this step, among them P-tuning v2 [11] which rivals full incremental fine-tuning at any scale for question-answering tasks, not only does it reduce the number of parameters that need to be fine-tuned to 0.1% of the original, but by using methods such as model quantization and Gradient Checkpoint, it is possible to improve training efficiency while reducing the amount of memory and computational resources consumed by the model when in use.These methods make the model more lightweight by optimizing and compressing it to reduce the demand for resources.At the same time, these methods have good performance preservation, i.e., they reduce resource consumption while keeping the performance of the model unaffected significantly.These methods make the model more lightweight by optimizing and compressing it to reduce the demand for resources.These methods are effective in maintaining performance, reducing resource consumption, and maintaining the model's unaffected performance.
Along with the rise of the large language models and the effective application of various fine-tuning technologies, the government consulting service should also usher in new development opportunities.Consulting service is one of the important works of the government office, and the excellent dialogue ability of the large language models can accelerate the development of intelligent consulting services.At present, the intelligence level of China's consulting service robots is insufficient, it can only output the original text matching the preset answers, and it can't make personalized answers to the user's questions.The text-to-text (T2TT) translation function with the help of the large language models and SeamlessM4T [12] can solve the demand for intelligent counseling to help the public more easily access, understand, and master the government's processes and work, avoiding problems such as no door to do business, no one to consult, and ineffective communication.Thus, the service is continuously provided around the clock.On the other hand, with the development of smart government, more and more government departments will begin to gradually utilize the large language models to improve the quality of consulting services, which has some limitations in providing accurate answers due to the relatively insufficient knowledge base of the large language models in the field of governmental consulting at present.This means that the model may not be able to give complete and accurate answers to questions involving the domain of governmental consulting.Therefore, this paper proposes a scenario-specific domain model to provide accurate policy information services for people, addressing the long-standing problems of difficulty in use and convergence without wisdom.This is bound to have a positive impact on socio-economic development.
Existing work, on the one hand, focuses on extracting domain-specific knowledge through retrieval-based approaches to enhance the performance of large language models in specific domains [13], [14] or external modules [15], [16].On the other hand, text embeddings are employed to retrieve potentially relevant text summaries from a corpus of domain-specific documents.The text is transformed into real-valued vectors encoding meanings.Identification of the text involves vectorizing the query question with all text blocks stored in the vector repository, pre-computed using cosine similarity between the query vectors and each block.A small portion of the most relevant data blocks is then appended to the user query, constructing effective prompts that are sent to the significant language model, resulting in a response [17].However, these methods have some limitations.Therefore, to address these writing problems, this paper proposes to fine-tune a large language model using P-tuning v2 [11], i.e., ChatGLM [6], which fine-tunes a large language model using the domain documents of a government consulting service to construct prompts.This fine-tuned large language model provides knowledge specific to government service consulting services at runtime.It makes it easier to maintain and protect privacy within the domain of government service consulting.
The contribution of this article is to collect common consultation Q&A information from government services in Beijing as a dataset and as a data source for fine-tuning the large language models.The fine-tuned model based on the P-tuning v2 [11] technique of injecting domain knowledge is used to provide good contextual information for the large language models, which is combined with the large language models to generate knowledge in the domain of government service counseling, and a large number of experiments have proved that the method of this paper is superior to the embedded text-based method in the domain of government service counseling.To better build the government service consulting service of digital government, the responsive answers express precise information using seven major languages in the world, which are Chinese, French, Spanish, Portuguese, German, and Japanese.New ideas and methods are provided for digital government reference counseling services to help digital governments better adapt to the information age changes and provide more innovative and valuable services.

II. RELATED WORK
The training of large language models is usually divided into two phases: pre-training and fine-tuning.In the pretraining phase, the model is trained using a large amount of unlabeled text data to learn the basic laws and patterns of language pairs, and the large language models thus acquire basic language comprehension and generation capabilities.In the fine-tuning phase, the model's broad language comprehension is made more adaptable to specific tasks and better responds to domain-specific nuances, thus improving the model's ability to generalize to unknown tasks [18], [19].However domain-specific tasks often involve complex concepts, technical terminology, and complex relationships between entities [20].In the absence of targeted guidance, large language models can be seriously illusory.
Recently, many relevant large language models have emerged in the medical [21], [22], [23], financial [24], [25], and legal domains [26], [27].Efforts have been made to improve the context-generation capabilities of large language models in specific domains using a combination of external knowledge [28].Among them are external modules to improve the context generation capability of large language models, such as TaskMatrix [7] and AutoGPT [16].These approaches are highly dependent on the prompt management of the large language models as well as the availability of external tools.However, for domain-specific scenarios, these external modules are not always effective.Alternatively, the use of chatbots [17], enables the use of large language models for specialized domains without updating the parameters.It is also possible to inject domain knowledge into the model using updating parameters using fine-tuning, and there have been some notable advances in fine-tuning techniques [11], [29], [30], [31], [32].Among them, P-tuning v2 [11] is an optimization and adaptation implementation of Deep Prompt Tuning [17], [33].One of its final improvements is to apply continuous prompts to each layer of the pre-trained model, not just the input layer.By increasing the capacity of continuous prompts, and for various settings, P-tuning v2 [11] improves performance comparable to full fine-tuning, and only needs to fine-tune 0.1% ˜3 % of the parameters.This paper mainly uses the latter in the field of government service consulting.
At different training stages, the training methods for large language models can be broadly categorized as follows: one is to train from scratch using domain data, Bloomberggpt [24] relies on a large amount of domain data, and the training cost is relatively expensive; one is to fine-tune based on the domain instruction data, e.g., ChatLaw [26]; and another method is to train the domain on the basic large language models based on the domain data, and then fine-tune with the instructions [27].

III. METHOD A. LM PROMOPTING
Language models are pre-trained by predicting the next tokens based on the previous tokens, known as autoregressive language modeling [34], [35], and training by this method enables zero-sample instruction learning.Specifically, a question and an instruction message are provided to the language model, and the model generates a corresponding answer based on the input text message.More specifically, each input message is first modified into a text string X called a prompt, using a template T for a particular instruction.As follows T : x → X .There is a question x ='' The term of the collective contract ?'', and command templates T: '' Please answer this question:{x} '', then the resulting hint will be T (x) = ''Please answer this question: the term of the collective contract ?''.The prompt is then forwarded to the large language models, through which the answer is generated.There are several challenges with this approach, the large language models rely too much on the knowledge of the parameters, and the lack of knowledge in the governmental domain tends to produce factually incorrect information.

B. GCALLM
To address the limitations of the current prompting scheme with language models, this paper combines the relevant domain knowledge in the fine-tuned model with the question and then forwards it to the large language models to generate the answer through the large language models.While giving full play to the advantages of the large language models, and standardizing its answer generation to ensure that it is sufficiently complete and accurate, this paper adopts a fine-tuned large language model combined with another large language model.The model interaction (shown in FIGURE 1) involves two key steps: obtaining a fine-tuned domainspecific model of the large language models and providing the generated domain-specific knowledge to the large language models.
In the first step, this paper uses the Q&A knowledge captured in the government service consultation documents to form QA Q&A pairs for domain fine-tuning of the large language models.The government service consultation documents are used as a knowledge base that contains a variety of common consultation questions.By utilizing this knowledge base, domain-specific knowledge is infused into the large language models through fine-tuning using P-tuning v2 [11].To facilitate model fine-tuning, prompts are constructed using the collected dataset of common government service inquiries and each prompt consists of a question and an answer.For example: {''content'': ''Can 40-year-old female comrades handle flexible employment ?'', ''summary'':'' Can handle.''}{''content'':'' Flexible Employee Social Security Contributions for the month of the increase in what time to deduct the fee ?'', ''summary'':'' The next monthly deduction for the procedure of adding members in the same month.''} In the second step, the fine-tuned model, which has been injected with domain knowledge, provides domainspecific Q&A knowledge of the large language models when the user accesses the fine-tuned model D. To follow specifically, first, using a command-specific template T ′ , each input is modified into a text string called a prompt X ′ , as shown in the following T ′ : (x, D) → X ′ .For example, there is a question x = ''Duration of the collective contract ?''.Fine-tuning model tip messages D='' Collective contracts generally have a duration of 3 years, and instruction templates definite information {D}.Based on the aboveknown information, use the known information for output, the question is:{x}''.Then the hint X ′ is obtained as ='' Known information: The term of a collective contract is usually 3 years''.Based on the above-known information, use the known information to produce an output that cannot be modified or added to.The output should be in Chinese.The question is: What is the duration of a collective contract ?''.The prompt is then forwarded to the large language models, through which the answer is generated.This knowledge enrichment enhances the understanding of the task context by the large language models, enabling them to generate more accurate and contextually relevant responses.The obtained response message is passed to the translation module.The module is a large-scale multilingual T2TT model (SeamlessM4T-NLLB [12]), which can understand

IV. EXPERIMENTS A. DATASET
In order to solve the use of a governmental consulting scenario dataset, this paper takes Beijing as an example, and collects the common questions of its governmental service consulting and the corresponding official standard answer as a dataset, which includes the frequently asked questions of medical insurance and the content of convenient services and other information.

1) DATA COLLECTION
Medical insurance, social insurance, housing provident fund, and flexible employment are common and popular keywords in government consulting services.The corresponding questions are common consultation questions from citizens, as well as some help and suggestions.Citizens can find relevant questions or provide suggestions on the official website of Beijing City, and the relevant departments will respond promptly.Below are three brief examples of Q&A responses.
Q: How long do you need to be in flexible employment to be eligible for a flexible employment subsidy again?A: Complete 90 days.Q: Can you switch to a CPF personal housing loan if you have already taken out a commercial loan?
A: Currently, you cannot transfer a commercial loan to a Housing Provident Fund (HPF) personal housing loan in Beijing.
Q: Can insured persons use their personal accounts at designated retail pharmacies?A: It can be used by insured persons to pay for their personal expenses on drugs, medical devices, and medical consumables at designated retail pharmacies.
In this paper, we will focus on answering questions with short word counts and extracting the answers corresponding to the official departmental responses.

2) DATA FILTERING AND POST-PROCESSING
As the data comes from online websites, the content is complex and contains a large amount of irrelevant content, which makes it difficult to conduct research.To address this problem, this paper deeply samples the collected data, summarizes the problems, determines the patterns, and designs the rules for data filtering.(1) Deletes the name of the matter, the subject of implementation, which is about the direction of the content of the question, as well as the official unit that answers the question.( 2) Normalize all connections in the data.( 3) Remove some answer contents that are irrelevant to the question, such as: ''Thank you for your interest in the transit industry and we welcome you to continue to monitor our work.''(4) Detecting the length of questions that are too long and replacing them with questions that are displayed briefly on the website, without using detailed question information.
(5) Delete some Q&A information whose answers are too long.( 6) Delete the name information encrypted by the sender and the time of the consultation.By implementing these rules for data filtering, the quality and usability of the dataset are improved.

3) STATISTICS
The dataset contains 1k Q&A pairs, which are common government service inquiry questions, in which the types of questions are categorized into 7 categories, and the statistical information about the specific dataset is shown in TABLE 1.In the dataset, each question corresponds to a different category, and good categorization information makes it easier for users to understand the specific question information.

B. EXPERIMENTS SETUPS
The experiment takes Beijing as an example, collects a dataset of common questions and corresponding answers for government service consulting.This paper follows the evaluation process of the NLG task and uses a set of metrics to comprehensively evaluate the quality of responses generated by large language models given a question, the Chinese responses of the evaluated responses are compared with the standard Chinese responses.In this experiment, this paper evaluates large language models a fine-tuned model (DLLM), a knowledge base combined with a fine-tuned model (KBDLLM), and a GCALLM, it uses seven metrics to comprehensively evaluate the response quality of the large language models.We evaluated a large language model, ChatGLM [6].It generates answers from input questions without introducing external domain knowledge.

1) GCALLM
GCALLM uses the P-tuning v2 [11] method to fine-tune ChatGLM [6] on the dataset collected in this paper to obtain a fine-tuned model of government service domain knowledge.Given a question, the domain information is first collected through the fine-tuned model, and then this domain information is assembled in conjunction with the question using a specific prompt template and then sent to the large language model to answer the question.

2) KBDLLM
First, the sentence vector of consulted government affairs information is obtained.This paper uses the text2veclargechinese model from the Sentence Transformers database to obtain the sentence vector.After acquiring the sentence vector in the knowledge base, FAISS [36] is employed to conduct a similarity search, FAISS [36] uses the inverted indexing approach to accelerate the approximation vector search, the top_k answers obtained, using the appropriate chunk_size to complement the contextual information of these answers, the answers sorted and assembled into contextual information and user input questions using langchain's prompt_template function to generate the large language models input required prompt prompts to guide the fine-tuning of the large language models from the answer.The large language models for fine-tuning are guided to understand and analyze the answers from the references and generate accurate and complete answers, and at the same time, it is emphasized that it is forbidden to generate answers by itself when it cannot come up with an answer and reasonable hints are given.

3) DLLM
On the dataset collected in this article, the P-tuning v2 [11] method was used to fine-tune ChatGLM [6] to obtain a fine-tuning model for knowledge in the field of government services.Directly use the question to access the fine-tuning model answers.

C. METRICS
As the large language models generate content with intelligent and random characteristics, this paper adopts the evaluation index Bert Score [37] based on the language model for evaluation, which uses contextual embedding to calculate the similarity of the markers, and extracts the features through the Bert model by transforming the generating text and the reference text into tokens, respectively, and then calculates the inner product corresponding to each word in the two texts, to construct a similarity matrix, based on which the maximum similarity scores of the two texts are calculated and normalized, and finally the Precision, Recall and F1 values are obtained.BLEU [38], ROUGE-1, ROUGE-2, and ROUGE-L [39] are also used to measure the degree of agreement between the responses generated by the large language models and the standard answers.driver is Cuda 12.0, the graphics card model used is RTX 3090 * 1, and the operating system version is Ubuntu 20.04.
V. RESULT TABLE 2 gives the evaluation results of ChatGLM [6] on seven metrics of the governmental dataset, in general, after fine-tuning the model's enhancement.Some metrics such as BLEU, ROUGE-1, ROUGE-2, ROUGE-L, BERTScore-P, BERTScore-R, BERTScore-F1.From Table 2, using only large language models for answering government consultation questions yields suboptimal results.The BERTScore-F1, BERTScore-P, and BERTScore-R values are 0.6145, 0.5834, and 0.6564, respectively.The BLEU score is 0.1154, indicating a low precision match between generated answers and standard answers.The ROUGE-1 value is 0.1954, suggesting a low overlap at the individual word level.The ROUGE-2 value is 0.0460, indicating a low overlap at the two consecutive word levels, and the ROUGE-L value is 0.1412, implying a low long-sequence matching and semantic relevance.Comparatively, DLLM shows improvements over LLM in BLEU, ROUGE-1, ROUGE-2, ROUGE-L, BERTScore-F1, BERTScore-P, and BERTScore-R.It enhances precision matching, single-word overlap, consecutive two-word overlap, and semantic relevance in answering government consultation questions.KBDLLM, compared to LLM, enhances precision matching, single-word overlap, consecutive twoword overlap, and semantic relevance.However, compared to DLLM, it exhibits a decrease in precision matching, singleword overlap, consecutive two-word overlap, and semantic relevance.Despite attempts to mitigate over-interpretation through prompts, KBDLLM's performance is inferior to DLLM.GCALLM surpasses LLM, DLLM, and KBDLLM in BLEU, ROUGE-1, ROUGE-2, ROUGE-L, BERTScore-F1, BERTScore-P, and BERTScore-R, achieving the highest numerical values.This results in optimal precision matching, single-word overlap, consecutive two-word overlap, and semantic relevance in generating answers compared to standard answers.Its outstanding performance in answering consultation questions is attributed to specific instruction templates and the incorporation of domain knowledge.GCALLM ensures the rigor and accuracy of answers, enhancing consultation services' intelligence, interactivity, and user-friendliness.The SeamlessM4T-NLLB module generates answers in seven different languages, accelerating the development of government service consultations.

VI. CASE STUDY
The responses with ChatGLM [6] before and after the approach combining the three models are given in TABLE 3, due to too much information in the word count, only some of the key information is given and ellipses are used to replace the rest of the content, it can be observed that the large language models causes a wrong understanding of the question and fails to provide the correct answer without any knowledge prompts.Injecting domain knowledge into a large language model to create a domain fine-tuned model with contextual knowledge has yielded promising results in generating responses without any explicit knowledge prompts.However, the completeness of the generated answers falls short when compared to the standard responses.The knowledge-based approach combined with fine-tuned modeling causes the fine-tuned model to generate brief answers due to too much information prompted by the knowledge base.Compared to the standard answers, this method exhibits a relative lack of completeness in information response.After undergoing GCALLM, accurate responses matching standard answers indicate the effectiveness of GCALLM in the context of government service consultations.The provided information is accurately presented in the seven major national languages, accelerating the development of digital government consulting services.

VII. CONCLUSION AND FUTURE WORK
This paper explores the use of the fine-tuning model combined with the large language models and knowledge base combined with the fine-tuning model, which can enhance the user query in the large language models in the government service consultation scenario.Due to the lack of relevant datasets, this paper proposes to use the common government-citizen interaction information on the website of the Beijing Municipal People's Government to construct a Q&A dataset, and then subsequently use a fine-tuning model combined with large language models for the large language models to provide domain-specific knowledge.To evaluate the effectiveness of the model, seven evaluation methods are used to verify the effectiveness of the model, and a large number of experiments prove that the domain-specific language model designed based on the scenarios of Beijing's governmental consulting service is effective.
However, the GCALLM model still suffers from phantom flaws, and fine-tuning the model based on the sequence of data, starting from low quality and progressing to high quality, short samples preceding long samples, and easy tasks preceding difficult ones, remains an area for improvement.In the future, the following steps of experimental exploration will be planned, including the implementation of other fine-tuning techniques and reinforcement learning to enhance the performance of GCALLM.
D. EXPERIMENTAL ENVIRONMENTS SETTINGSOur experimental environment uses the deep learning framework of Pytorch version 2.0.1 The NVIDIA graphics card

TABLE 1 .
Government data analysis.

TABLE 3 .
Examples of government questions and answers.