Integrating Static Quality Assurance in CI Chatbot Development Workflows

To fill a gap in proposals to integrate automated quality assurance mechanisms into the chatbot development workflow, we present a continuous integration workflow for chatbot development, implemented as GitHub actions, and show its usefulness by its application to open source chatbots.

QA mechanisms that the selected chatbot development platform provides. 1Moreover, the abundance of technologies makes it difficult to develop QA techniques-especially static ones-applicable to all of them.
To alleviate this problem, we propose chatbot QA techniques executable as part of continuous integration (CI) workflows via a ready-to-use GitHub action.Our proposal is technology-independent since our QA techniques are applicable to several chatbot platforms and versions by the use of an intermediate chatbot representation. 6For instance, the same QA workflow can be executed on a chatbot implemented in Rasa 2.0, on its evolution to Rasa 3.0, or on a Dialogflow chatbot.Our workflow supports the extraction of the chatbot design, its graphical visualization, its static analysis (e.g., to detect issues like poorly trained chatbot intents, or defects in the designed conversations), and its measurement (e.g., to compare design aspects, like size or complexity, against thresholds established by the development organization).
This article describes our proposal and evaluates its usefulness for the QA of open source chatbots built with heterogeneous technologies.

Task-Oriented Chatbots
Open-domain chatbots, like Chat-GPT or Gemini, rely on large-language models (LLMs).These are deep-learning architectures trained on vast amounts of data and able to generate text upon user prompts.Our interest is on chatbots that perform specific tasks like booking a flight on an airline information system.While open-domain chatbots could be fine-tuned for the task, and prompts could be designed to instruct the LLM to complete the task, the technology to achieve reliable, robust, trustworthy task-oriented chatbots using LLMs is still in the making. 7nstead, task-oriented chatbots are designed around the intents, or tasks, that the chatbots must address.Prominent technologies for developing these chatbots are Google's Dialogflow, 3 IBM Watson, 8 Amazon Lex, 4 or Rasa. 5 Each intent defines training phrases that exemplify how to express the user's intention (e.g., "I'd like to book a flight to London" if the intent is booking a flight).As Figure 1 depicts, the user interacts with the chatbot through a channel, e.g., a website or a social network like Slack or Telegram.When the user produces an utterance (step 1 in the figure), the chatbot matches the most similar intent with a confidence level (step 2).If the confidence is below a threshold, then a fallback intent is selected (if one is available), which informs that the user utterance was not understood (step 3a).
Intents may also declare parameters, which are pieces of information that the chatbot needs to extract from the utterances (step 3b).For example, in the flight booking intent, the chatbot needs the origin, destination, and date of the trip.Parameters are typed by entities, which can be predefined (e.g., numbers, dates) or domain-specific (e.g., airport codes).Parameters may also be optional or mandatory.Whenever the user does not provide a mandatory parameter, the chatbot will prompt the user for it.After extracting the parameters,

Act n Params
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

FEATURE: STATIC QUALITY ASSURANCE FOR CHATBOTS
the chatbot will execute the actions associated to the intent, like querying an external application programming interface (e.g., to retrieve the available flights) or replying to the user (steps 4b and 5).
Finally, intents can be organized into conversations, whose entry points are called flows.Each conversation flow may then bifurcate into several paths, depending on the user responses.

Chatbot QA
Since chatbots are a kind of software, analyzing their quality using dedicated tools is a fundamental task in their development cycle.Three main approaches can be used for this purpose: testing, metrics, and static design validation.
Testing is a widespread technique for checking the correctness of software systems, recently applied to chatbots. 9,10,11In industry, Botium (https:// cyara.com/products/botium/)permits defining test cases (scenarios of expected user-chatbot interactions) and executing them against chatbots.Likewise, Rasa provides a continuous integration/continuous deployment (CI/CD) pipeline (https://rasa.com/docs/rasa/setting-up-ci-cd) with mechanisms for testing the chatbot message processing and dialogue management.In particular, the train-test GitHub action (https://github.com/marketplace/actions/rasa-train-test -model-github-action) facilitates the training and testing of Rasa chatbots on GitHub.Overall, the applicability and usefulness of chatbot testing is well recognized.However, testing requires having a functional chatbot (precluding its use in early development phases), defining test scenarios of user-chatbot interactions, and it may consume considerable time and resources.
As a complement to testing, researchers have investigated ways to detect potential defects in chatbots earlier, at the design level.For example, Chatbottest 12 outlines guidelines for identifying chatbot design issues in categories like answering, error management, intelligence, navigation, personality, and understanding.However, the burden of assessing whether a chatbot meets the guidelines is on the developer.
Also at design level, we propose metrics and static analysis as inexpensive mechanisms for detecting chatbot design issues automatically.The goal is being able to uncover common errors related to user experience (e.g., chatbots issuing mostly messages with negative sentiment), chatbot comprehensibility (e.g., wordy or long conversations) or intent quality (e.g., insufficient training phrases) effortlessly and without requiring a functional version of the chatbot.Our chatbot design metrics and static analyzer are defined on a neutral chatbot design language, called Conga. 6This encompasses the features of 15 of the most widely used intent-based chatbot construction platforms 1 which enables mapping the design concepts of Conga from and to these platforms.This way, the metrics and analyses applied to Conga designs can be considered technology-agnostic, being applicable to many chatbot platforms.

A CI Workflow for Chatbot Development
To help improve the chatbot development, release, and maintenance process, we have defined the CI workflow depicted in Figure 2. It comprises a GitHub action (see https://github.com/satori-chatbots/chatbots-actions/) that triggers automated static quality assurance (SQA) checks whenever a change to a chatbot is pushed into the repository.Our action assumes that the chatbot is in the repository, so for platforms like Dialogflow, the chatbot needs to be exported from the platform and then imported into the repository.
To achieve technology independence, the action first transforms the chatbot into the Conga language.Then, it produces a graphical visualization of the chatbot design; computes metrics of the chatbot design, comparing their value against predefined thresholds; and performs static analysis of the chatbot design to detect potential problems.Finally, it reports the workflow results to the developer.Compared to the Rasa CI/CD workflow, we focus on static analyses-rather than on testingand remain technology-independent.Next, we detail the workflow steps.
Our proposal is technologyindependent since our QA techniques are applicable to several chatbot platforms and versions by the use of an intermediate chatbot representation.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Transformation into Conga: In previous work, 6 we created the Conga language using model-driven engineering.Its abstract syntax is defined by a metamodel, and its concrete syntax is textual.Its architecture is modular and extensible, enabling the provision of parsers and code generators from/to different chatbot tools.This makes our CI workflow independent of the chatbot technology since adding a parser from a specific technology to Conga makes the action available for that technology.Currently, Rasa (versions 1.10 to 3.0) and Dialogflow are supported.
Design visualization: To facilitate comprehension, our action produces a graphical visualization of the chatbot design that abstracts away the accidental complexity that specific chatbot technologies may introduce.For example, Rasa chatbots are defined in multiple files and use Python programming, while Dialogflow chatbots are defined via web forms, making it difficult to understand the design underlying the implementation.Instead, the produced visualization represents the chatbot design as a state machine, where messages in arrows correspond to user utterances, and states correspond to chatbot actions.
Design metrics: In previous work, 13 we developed a suite of metrics (cf.Table 1) measuring internal quality properties of chatbot designs about their size (number of intents, supported languages, flows, paths), intent quality (number and complexity of training phrases, similarity of intents), output phrases (length, reading time, complexity, readability, sentiment), vocabulary (number of entities, complexity of entity literals), and conversations (length, paths, actions).
Metrics can help uncovering potential problems, like intents with similar training phrases (so the chatbot may get confused and not recognize the intention behind a user utterance), low number of training phrases, too long conversations (difficult for users to complete), or lengthy chatbot responses (tedious to read, to listen to in case of voice chatbots, or even impossible to deploy in certain channels like X/Twitter).
Our action enables measuring designs and defining thresholds to ensure adherence to internal company guidelines or quality standards, like setting a minimum number of training phrases per intent (e.g., 10 are recommended in Abdellatif et al. 14 ) or a maximum output length (e.g., when targeting restricted channels).
Design validation: Our static analysis performs checks on the chatbot design to detect issues.These checks complement metrics, detecting wellformedness problems (e.g., duplicate identifiers, several conversation flows starting with the same intent, malformed regular expressions), unused elements (e.g., unused intents), nonterminating conversation loops, repeated training phrases, or lack of a fallback intent, among others.We support both technology-agnostic validations that any chatbot design should fulfill, and technology-specific ones.For example, Rasa does not support multiple fallback intents or multilanguage chatbots, and Dialogflow can issue at most one HTTP request in every conversation turn.The issues are Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
FEATURE: STATIC QUALITY ASSURANCE FOR CHATBOTS classified into errors (i.e., the design is malformed, so the chatbot will not compile or will fail at run-time) and warnings (i.e., quality problems that may make the chatbot not work as expected).We currently support 20 technology-agnostic validations (eight errors and 12 warnings), two specific validations for Dialogflow, and three for Rasa.
We designed our validations considering several sources.Errors and technology-specific checks stem from an analysis of existing platforms. 1he remaining ones reflect our experience on developing chatbots, the analysis of open source chatbots, 13 and guidelines and recommendations from the literature (e.g., few training phrases in intents). 14esults report: Figure 3 shows a screenshot of the results of the action execution, triggered upon pushing into the repository.The action is lightweight, typically completed in seconds (29 s in the figure) since it does not require executing the chatbot.This allows an inexpensive, early assessment of the chatbot quality even when the chatbot is not fully functional.The results comprise: 1) a diagram (generated with PlantUML) of the chatbot design represented as a state machine; 2) a table with the metric values, and indicators of their compliance to the thresholds established within the project; and 3) a table with the problems found.The metrics to calculate and their thresholds are configurable.In the figure, the developer filtered out some metrics and established thresholds based on her experience and knowledge about the expected chatbot usage.Empirical studies about chatbot usage could be used to identify metric outliers (too low or too high values) and fine tune such thresholds. 13In this example, the metrics identify potential issues due to intents with few training phrases and entities with no literals.Moreover, the static analysis detects unused intents (i.e., intents that are not part of any conversation) and intents with a low number of training phrases (<3).

Evaluation
To assess the usefulness of our proposal, we analyzed the commit history of open source chatbots with the goal of answering the research question (RQ): "Could our SQA action have helped detecting potential issues committed during the chatbot development process?".
The top of Figure 4 shows the analyzed chatbots.We selected them by crawling GitHub to identify relevant repositories containing Rasa or Dialogflow chatbots, as well as a history with at least six commits modifying the chatbot specification (i.e., not only changing the backend).We filtered out non-English chatbots using a language identification service and then kept the chatbots with more commits.Overall, we selected 20 chatbots, 18 built with Rasa, and two with Dialogflow.The latter chatbots are much less common in GitHub since they must be exported from Google's Dialogflow platform before pushing them into GitHub, e.g., along with their backend.
Then, we applied our SQA action to each commit modifying the chatbot.Figure 4 summarizes the results (full data at https://github.com/satori-chatbots/chatbots-actions -experiments).The table shows the number of commits (modifying the chatbot and total), different issues throughout the commit history, remaining issues after the last commit, average number of total commits until an issue is resolved, and types of pending issues.The types of identified issues are G9 (intent without training phrases in one of the chatbot languages), G12 (unused entity), G13 (intent not used in any conversation), G15 (intent poorly trained, with less than three training phrases), G16 (two intents starting different conversation flows have common training phrases), G17 (intent with duplicate training phrases), G18 (improper use of text parameter in training phrase), and G19 (missing fallback intent).After the last commit, only three chatbots were free of defects.
Graph (a) in Figure 4 shows the number of issues corrected across commits, and the remaining ones.From the eight issue types detected, all except G12 are present in the last version of some chatbot, and globally, 25.5% of issues remain.Graph (b) shows the percentage of chatbots with a given issue in their last version.Overall, 85% of chatbots have some issue in their final version.Moreover, the average number of commits before an issue is resolved ranges between 4.6 and 112.8.This suggests that the projects could have benefited from our QA action.
Not all detected issues affect the chatbot behavior.For example, Conf-Chatbot has many unused intents (G13), and many of their intents lack at least three training phrases (G15).However since Conf-Chatbot does not use these intents in any conversation flow, their presence does not affect the chatbot behavior.However, it includes unnecessary data in the chatbot definition-akin to "dead code."Similarly, issue G17 (an intent has duplicate training phrases) does not affect behavior but gives the false impression of high-quality training.This issue is present in 30% of chatbots.Since training phrases in Rasa are defined in text files, this is prone to copy-paste mishaps.
However, some encountered issues interfere with the proper chatbot operation and become errors which should be fixed before releasing the chatbot.For instance, Cinebot features issue G16 from the second to its last version.Specifically, its intents "book_tickets" and "collect_tickets" define the same training phrase "tickets please."Since these intents are entry points for two conversation flows, there is a conflict.Actually, if a user says this phrase to Cinebot, its natural language understanding model favors the "book_tickets" intent (with a confidence of 0.9091, against 0.0890 for intent "collect_tickets").In practice, this precludes starting the conversation to collect tickets using this phrase.Additionally, both Dialogflow chatbots have few training phrases (G15, metric TPI), and rather short (metric WPTP), which may hinder these chatbots from recognizing the user intention, producing incorrect outcomes.Notably, 50% of chatbots have issue G19 (missing fallback intent).This means that the chatbot will not reply when The detected problems generally persisted through numerous commits and, at the end, the last version of 85% of chatbots have design problems.
FEATURE: STATIC QUALITY ASSURANCE FOR CHATBOTS it does not understand the intent of the user.Finally, 40% of chatbots have intents without training phrases (G9), which makes the chatbots unable to recognize the user intention in those cases.For example, the welcome intent of ISU-Jovo-v2 lacks phrases, so the chatbot does not recognize when the user starts a conversation by greeting.
Regarding performance, the execution time is in the order of seconds for the whole process (between 30 and 90 s for the analyzed chatbots), which is reasonably fast for the number of analyses performed.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.many commits.Hence, we can answer the RQ affirmatively: Our SQA action has detected these problems, which is a first step toward their resolution.Actually, to assess the resolvability of the detected problems, we successfully fixed by hand those present in chatbot Cinebot.As a validity threat, these results are specific to the analyzed chatbots and cannot be generalized, i.e., other open source chatbots may have other problems or lack problems.To mitigate any possible bias, we tried to select quality chatbots.
C hatbots should be devel- oped following sound software engineering principles.We claim that CI can help to achieve this aim, as is the case for other types of software.CI automates the integration of code changes by multiple contributors into a common repository, after asserting the correctness of the new code.Our proposed CI workflow comprises a GitHub action covering design visualization, measurement, and static analysis for chatbots.We challenge the community to provide further chatbot quality assessments, which can be integrated as part of CI workflows.In this respect, we are currently working on a technology-independent unit testing action, using Botium as a basis.
Our SQA action is technology-independent.We used it to analyze the history of 20 open source Rasa and Dialogflow chatbots, detecting problems in 85% of them, which suggests the usefulness of these techniques.While we focused on intent-based chatbots, emerging chatbot construction paradigms based on LLMs, like LangChain (https://www.langchain.com/), will demand QA mechanisms, likely integrated into CI workflows.
Finally, we foresee the need to migrate intent-based into LLM-based chatbots, and to make both chatbot types interoperable.A technologyindependent approach like Conga can help in this respect.

FIGURE 2 .
FIGURE 2. Scheme of our CI workflow for chatbots.
chatbot supports LANGUAGE, but the INTENT does not have an input in this language.G12 = ENTITY should be used by some parameter.G13 = The INTENT is never used in a flow.Intents should be used in some flow.G15 = The INTENT must contain at least three training phrases for each language.G16 = Confusing intents: two intents are in the start of a flow and contain equal training phrases.G17 = Repeated training phrases for the INTENT.Two training phrases cannot be equal in the same intent.G18 = The INTENT contains a training phrase with only a text parameter.Training phrases should contain something more than a text parameter.G19 = The chatbot should contain at least one fallback intent.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Overall, the detected problems generally persisted through numerous commits and, at the end, the last version of 85% of chatbots have design problems.Even the three chatbots without final issues had a substantial number of them during development, which remained in ABOUT THE AUTHORS JESÚS SÁNCHEZ CUADRADO is an associate professor at the Languages and Systems Department of the University of Murcia, 30100 Murcia, Spain.His research interests include model driven engineering topics, notably model transformation languages, metamodeling, and domain specific languages and recently in the application of AI to software modeling.Caudrado received his Ph.D. in computer science from the University of Murcia.Contact him at http://sanchezcuadrado.es or jesusc@um.es.PABLO C. CAÑIZARES is an assistant professor at the University Autónoma of Madrid, 28049 Madrid, Spain.His research interests include software testing, modeldriven engineering, distributed systems, and chatbots.Cañizares received his Ph.D. in computer science from the Complutense University of Madrid.Contact him at pablo.cerro@uam.es.DANIEL ÁVILA is a research project associate with the University Autónoma of Madrid (UAM), 28049 Madrid, Spain.His research interests include the development of computational intelligence, sentiment analysis, machine learning, model-based engineering, and chatbots.Ávila received his master's in research and innovation in computational intelligence and interactive systems from UAM. Contact him at rdavilao@outlook.com.ESTHER GUERRA is a full professor in the Computer Science Department, University Autónoma of Madrid (UAM), 28049 Madrid, Spain.Her research interests include model-driven engineering, flexible modeling, domain-specific languages, and chatbots.Guerra received her Ph.D. in computer science from UAM. Contact her at esher.guerra@uam.es.SARA PÉREZ-SOLER is an assistant professor at the University Autónoma of Madrid (UAM), 28049 Madrid, Spain.Her research interests include model-driven engineering, conversational agents, and domain-specific languages.Pérez-Soler received her Ph.D. in computer science from UAM. Contact her at sara.perezS@ uam.es.JUAN de LARA is a full professor in the Computer Science Department, University Autónoma of Madrid (UAM), 28049 Madrid, Spain.His research interests include automated software development, model-driven engineering, domain-specific languages, and chatbots.De Lara received his Ph.D. in computer science from UAM. Contact him at juan.delara@uam.es.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.FIGURE 3. Screenshot of the execution of the SQA action.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.FEATURE: STATIC QUALITY ASSURANCE FOR CHATBOTS FIGURE 4. Summary of results of the SQA action on the selected chatbots.