AI-Based Conversational Agents: A Scoping Review From Technologies to Future Directions

Artificial intelligence is changing the world, especially the interaction between machines and humans. Learning and interpreting natural languages and responding have paved the way for many technologies and applications. The amalgam of machine learning, deep learning, and natural language processing helped Conversational Artificial Intelligence (AI) to change the face of Human-Computer Interaction (HCI). A conversational agent is an excellent example of conversational AI, which imitates the natural language. This article presents a sweeping overview of conversational agents that includes different techniques such as pattern-based, machine learning, and deep learning used to implement conversational agents. It also discusses the panorama of different tasks in conversational agents. This study also focuses on how conversational agents can simulate human behavior by adding emotions, sentiments, and affect to the context. With the advancements in recent trends and the rise in deep learning models, the authors review the deep learning techniques and various publicly available datasets used in conversational agents. This article unearths the research gaps in conversational agents and gives insights into future directions.

it has transformed human life. One of the important branches 23 of artificial intelligence is conversational AI which makes 24 machines capable of understanding, processing, and respond- 25 ing to humans in natural language. Conversational agents 26 The associate editor coordinating the review of this manuscript and approving it for publication was Utku Kose .
have remained the center of the AI revolution in the past few 27 years, powered by Natural Language Processing (NLP) and 28 Machine Learning (ML) technologies.

29
A conversational agent [1] is an Artificial Intelligence (AI) 30 program that originated to imitate human conversations using 31 spoken or written natural language over the Internet. Many 32 alternative terms are used for conversational agents. Ear-33 lier, dialogue system, this term was popular. But nowadays, 34 chatbots, smart bots, intelligent agents, intelligent virtual 35 assistants/agents, interactive agents, digital assistants, and 36 relational agents are used alternatively in research articles 37 [1], [2]. Conversational agents are the practical implementa-38 tion of AI technology in industries or businesses. Conversa-39 tional agents can be seen being used in various applications 40 to understand a user's feelings, context, and mood, gener-97 ate intelligible and engaging responses in conversations and 98 respond with personalization with sentimental and emotional 99 analysis. All these characteristics of human conversations in 100 conversational agents can be accomplished to some extent 101 with the help of advanced natural language processing and 102 machine learning systems. Implementation, flexibility in var- 103 ious application domains and capability to mimic natural 104 conversation to some extent have been accomplished with the 105 help of advanced natural language processing and machine 106 learning systems. But machine learning approaches have 107 drawbacks like learning from human-labeled data that is time- 108 consuming and dependent on human efficiency. 109 To summarize, this paper makes a number of significant 110 contributions as follows: 111 1) Illustrate the basic working architecture of recent con-112 versational agents. 113 2) Literature review of implementation methods used in 114 different components of conversational agents. The rest of this article is organized as follows. 122 Section 2 provides background knowledge in terms of the 123 working of conversational agents. Section 3 discusses the 124 literature review. Section 4 describes the comparative study 125 of different deep learning techniques used in conversational 126 agents and a brief overview of major datasets used in con-127 versational agents' research works. The research gaps with 128 future directions are discussed in Section 5. And paper is 129 concluded in section 6. Conversational agents appear to work simply at first glance 133 when a user interacts with them and receives a suitable 134 response. However, various technologies are at work behind 135 the scenes to ensure smooth interaction. Natural Language 136 Understanding (NLU unit) and Natural language Generation 137 (NLG unit) are the major components of conversational agent 138 architecture [12]. Figure 2 illustrates the architecture of con-139 versational agents. Natural Language Understanding is the key component where 142 natural language processing and understanding user requests 143 are done [13]. User query/message will be provided to the 144 natural language processing unit as input. This unit's job is to 145 prepare and clean the input text data, which includes text pre-146 processing steps [14]. These steps are important to interpret 147 grammar and break down an input request into words and 148 sentences, making it easier for a conversational agent to 149 understand. Then cleaned input text is converted into feature 150  as input and intents as a target in intent identification. So, 159 the outcome of this task will be intent. The entity recognition 160 task. identifies and separates discrete pieces of information 161 into different pre-determined groups such as people, organi-162 zations, etc., from input text. The outcome of this task is rec-163 ognized entities. In cognitive understanding, certain subtasks 164 are performed, such as sentiment analysis, emotion detection, 165 and checking spellings. Cognitive understanding will help 166 conversational agents analyze the user's sense, tone, or mood 167 of the input text to improve response generation accuracy. 168 The outcome of NLU will be the semantic representation 169 of context information by combining intent, entities, and 170 cognitive information into a structured input. This context 171 information of the user message will be the current state 172 VOLUME 10, 2022

202
On the topic of conversational agents' inconsiderable litera-203 ture reviews have been written. Table 1 presents an overview 204 of different survey papers in conversational agents. The pre-205 sented surveys lack thorough analyses of datasets, affective 206 components, multimodality studies, and performance met-207 rics. In the field of conversational agents, different meth-208 ods and techniques have been used for implementation. 209 This study aims to examine various approaches and meth-210 ods in conversational agents that can be used as a founda-211 tion for future empirical research. We've concentrated on 212 crucial research areas, such as technical obstacles, datasets, 213 the methodologies proposed in each study, and their perfor-214 mance metrics and application fields. Advanced deep learn-215 ing models (pre-trained models) have been increasingly used 216 in conversational agents. In this section, the key findings 217 of conversational agents and related literature work are dis-218 cussed from different perspectives, such as approaches used 219 in conversational agents for the implementation of different 220 tasks, the current trend of making empathic and context-221 aware conversational agents, deep learning approaches used 222 in conversational agents, multimodality, datasets and applica-223 tion areas are given in this section. In conversational agents, different tasks are performed to 227 understand the user input and generate a response according 228 to that input. Figure 3 shows an overview of techniques 229 used in different tasks of conversational agents. Different 230  When a user enters the text or query, the first step of a 239 conversational agent is to prepare the data in the appropri-240 ate form to be passed to the natural language understand-241 ing unit. User input may have emojis, short text, informal 242 words, incomplete words, etc., that make pre-processing a This component has three language comprehension tasks: 265 intent classification, entity identification, and cognitive 266 understanding. Intent classification comprehends the purpose 267 of the input. Entity identification finds the distinct pieces 268 of information. So, entities combined with the intent, allow 269 the agent to fully understand the user's input. Conversational 270 agents must comprehend the user intent and perform the 271 required actions. Intent classification is to understand the why 272 of the input [15], and entity identification is to understand the 273 what of the input [16]. Cognitive understanding has become 274 a significant step in conversational agents to understand the 275 input and understand the user.  (Conditional Random Fields) are frequently employed in 282 NER and found in many applications [17]. Certain subtasks such as sentiment analysis, emotion detec-310 tion, and spell checker are performed in cognitive understand-311 ing. Cognitive understanding will help conversational agents 312 analyze the user's sense, tone, or mood of the input text to 313 improve response generation accuracy. To make emotion-314 ally aware or empathic context-aware conversational agents, 315 we need to add tasks such as emotion analysis or sentiment 316 analysis in the NLU component. Emotion [29] is an essen-317 tial aspect of making a context-aware conversational agent. 318 Emotion detection is aimed to extract and study fine-grained 319 emotions from text, such as anger, happiness, sadness, etc.

320
The various methods are utilized for analyzing text-321 based emotions; deep learning-based, machine learning-322 based, rule-based, and keyword-based approaches. Finding 323 the frequency of a keyword in user input and comparing the 324 labels with the dataset is a keyword-based method.
[30] used 325 keyword-based emotion recognition approach to finding the 326 context from text. In the rule-based methods, grammatical 327 and logical rules are decided to detect emotions from the user 328 text. Reference [31] defined a rule-based emotion detection 329 system to detect implicit emotions from text data. Machine 330 learning-based approaches enable algorithms to learn and 331 improve automatically through experience. Machine Learn-332 ing methods categorize the user text into different pre-333 determined emotion categories. Reference [32] proposed a 334 machine learning-based system to classify conversational 335 emotions. In artificial intelligence, deep learning is a subset 336    Before understanding the term context-aware, we will take 372 a glance at what is the context? Context is a cause of an 373 event. The situation within which something exists or hap-374 pens and can help to explain it. It is circumstances forming a 375 background of an event or a statement. So, context-aware is 376 the ability of a system to perceive the user's environment or 377 situation to reason appropriately. Context-awareness gives a 378 system to see at the same level as a human and helps figure 379 out in which sense the user is asking a question to revert to 380 those sentiments and behavior. Conversational agents do not 381 have automatic knowledge of their own, so they cannot use 382 the context like humans. So, it is necessary to provide or feed 383 them with the right information in context so they can use 384 VOLUME 10, 2022 context on their own. So different kinds of information can be 385 provided to conversational agents to understand the context.  language processing techniques. Initially, Keyword-based 440 or pattern-based conversational agents were implemented. 441 These were easy to design and implement but had limitations 442 in responding to complex queries. These were designed to 443 answer based on patterns or rules, but if a query comes out 444 of a pattern, it will provide erroneous answers that would 445 not be related to the query. Machine learning-based con-446 versational agents trained on existing annotated datasets of 447 conversations. Usually, these retrieval-based machine learn-448 ing models retrieve the information from the database based 449 on a user query. But the main drawback of this approach is 450 generating a large volume of knowledge base, which can be 451 time-consuming, costly, involves human efficiency and, again 452 domain dependency.   Table 3 shows that transformer-based 471 architectures built most of the current conversational agents. 472 These models are designed to increase the likelihood of a 473 response and are capable of understanding a large amount of 474 data to provide an acceptable response. The basic transformer 475 design is made up of two recurrent neural networks (RNNs), 476 one that processes the input is the encoder, and the other 477 that generates the response is the decoder. These models are 478 popularly known as Sequence-to-sequence models. The most 479 prominent RNN variants utilized to learn the conversational 480 dataset in these models are long short-term memory (LSTM) 481   Due to recent commercial applications like Amazon's Alexa, 512 Apple's Siri, Microsoft's Cortana, and Google Assistant, 513 conversational systems have recently witnessed a consid-514 erable increase in demand. As more and more businesses 515 are pushing for this technology, conversational agents are 516 rapidly becoming commonplace. Humans communicate with 517 one another through a variety of senses or modalities. These 518 modalities work in concert to clarify concepts and emphasize 519 ideas in dialogue by resolving ambiguity. 520 Nowadays, emoticons (objects encoded by standard 521 sequences of characters) or emojis (e.g., smilies, hearts) are 522 self-reported labels, i.e., visual information, provided by the 523 users to convey emotions in their textual interactions the 524 underlying the context of the communication, aiming for bet-525 ter interpretability, especially for short polysemous phrases. 526 In conversational agents, this visual information conveys 527 affective states and thus are suitable indications of sentiment 528 and emotion in texts. These emojis/emoticons, along with the 529 text, present a more faithful representation of the user's emo-530 tional state. In sarcastic sentences, a user may express positive 531 emotion in the text, but by using emoticons/emojis, he/she 532 may express negative emotions, so in such scenarios, visual 533 information can help us to identify the true emotions of the 534 user. This visual information in the form of emojis/emoticons 535 helps in natural language understanding. E.g., I am happy 536 with the service!!  So, the unexploited potential exists in the study of multi-587 modal conversational agents, which let users and conversa-588 tional agents converse using both human language and visual 589 information to be more realistic, human-like, and engaging. 590 Sunder and Heck [10] have defined and mathematically for-591 mulated the goal of the multimodal conversational study. 592 They suggested four basic problems in multimodal conversa-593 tional systems: disambiguation, response generation, coref-594 erence resolution, and dialogue state tracking. The authors 595 suggested a taxonomy of the types of study that are nec-596 essary to accomplish the goal of multimodal conversational 597 agents: multimodal representation, multimodal fusion, mul-598 timodal alignment, multimodal translation (cross-modality), 599 and co-learning. Thus, it becomes necessary to consider this 600 taxonomy while designing multimodal conversational agents. 601

602
This section discusses the primary datasets used by 603 researchers in the field of conversational agents. In conver-604 sational agents, researchers have curated their own datasets 605 or used publicly available datasets according to the needs 606 of studies in certain application areas. This section presents 607 some publicly available and useful datasets specifically 608 VOLUME 10, 2022 for improving context-awareness in conversational agents. Table 4 gives an overview of specific datasets with sources 610 from which they were curated, whether the dataset is labeled   Conversational agents can be used in various application 655 domains with different goals or objectives. Conversational 656 agents can be utilized for decision-making, opinion, con-657 flict resolution, and multi-party interaction. Conversational 658 agents play diverse roles as information providers, recom-659 menders, tutors, entertainers, advisors, personal assistants, 660 customer service assistants and conversational partners in 661 various fields. Review articles that have discussed the role 662 of conversational agents in various application fields such 663 as business [79], customer services [80] Table 6 highlights the application area, objectives, method 705 used, data utilized, and challenges in the selected papers.

706
From an applications perspective, there is still a disconnect 707 between industrial technologies and current breakthroughs 708 in the sector. The technologies utilized in research are not 709 suited for use in the industry since they demand a lot of 710 computational resources and extremely big training datasets. 711 Again, conversational agents that must be used in various 712 businesses have distinct requirements. Also, protecting users' 713 personal information is an important issue in conversational 714 agents. A review table of the conversational agents in differ-715 ent application areas is presented in Table 6.

717
With the help of deep learning models, conversational agents 718 have made significant development in recent years. Several 719 unique ideas such as pretrained embedding, different atten-720 tion mechanisms, transformer-based models, pretrained deep 721 learning models, and seq2seq models have been developed, 722 resulting in rapid advancement in the last few years. Despite 723 the advancements, there are still issues to be resolved in the 724 field of conversational agents. The major limitation is making 725 conversations natural with humans with the help of empathy, 726 sentiments, and emotions. This section highlights some of 727 these issues as well as research directions that could aid in 728 the field's advancement. 729 VOLUME 10, 2022

A. LIMITATIONS IN UNDERSTANDING THE CONTEXT [87]
730 This is the biggest challenge for conversational agents. Con- Datasets plays an important role as training data is required 774 to understand intent and context and respond naturally to 775 user. There are many challenges related to datasets like small 776 datasets available, scarcity of labeled data, unbalanced distri-777 bution of data, less variety of labels in datasets, and lack of 778 representative publicly available datasets. All dataset-related 779 challenges are important as machine learning, and deep learn-780 ing techniques require a huge amount of data for training.  This section highlights some of the research directions that 841 could aid in the field's advancement. Techniques or method-842 ologies in conversational agents have seen huge signs of 843 progress in the last few years, from rule-based methods to hid-844 den layer-based deep learning methods and pretrained mod-845 els such as. Artificial Intelligence advancements in recent 846 years have bolstered trends in conversational agents, mak-847 ing them to understand and reply in natural languages and 848 reply. The authors have discussed research gaps in section 5. 849 Table 7 shows the mapping between research gaps and future 850 directions. To make conversational agents contextual, they 851 need a lot of data and a vast knowledge base for training. 852 So, training conversational agents on large datasets is one 853 of the solutions to make them contextual. Also, self-training 854 and reinforcement learning techniques can be applied to make 855 them contextual. Some of the difficulties mentioned in the 856 research gaps section have been addressed by AI-based tech-857 nologies such as transfer learning, reinforcement learning, 858 multi-task learning, meta-learning, self-learning, and GAN's. 859 This section discusses challenges and their solutions using 860 these methods with references.

862
Transfer learning has majorly contributed to the progress of 863 modernistic NLP systems like conversational agents. Partic-864 ularly conversational agents can be benefited from inductive 865 transfer learning, where unlabelled data is employed to pull 866 knowledge for labeled downstream tasks. [95] discussed a 867 framework to transfer the affective knowledge. In this pro-868 posed system, authors pre-trained a hierarchical dialogue 869 model on multi-turn conversations (source) and then trans-870 ferred its parameters to a conversational emotion classifier 871 (target).

873
In reinforcement learning, conversational agents are trained 874 through trial-and-error conversations with either real users 875 or a rule-based user simulator. [96] proposed deep rein-876 forcement learning for dialogue generation. [96] work 877 marked a first step towards learning a conversational 878 neural model based on the long-term success of dia-879 logues. [97] developed a reinforcement learning-based emo-880 tional editing constraint conversation content-generating 881 model.

883
All or a subset of the tasks in multi-task learning are related 884 but not identical. It aims to help improve the learning of a 885 model for a task by using the knowledge contained in all-886 related tasks. Basic two factors are considered for multi-task 887 learning [98]. First is relatedness, that is how different tasks 888 VOLUME 10, 2022 Many tasks in natural language processing (NLP), such as 915 question answering, necessitate a substantial amount of train-916 ing data to improve model performance. To generate a large 917 amount of training data, however, gathering and annotating 918 more data can be an expensive and time-consuming opera-919 tion. Data augmentation strategies can be used in this situa-920 tion. One of the data augmentation techniques is GAN. These 921 are computational structures that set two neural networks 922 against each other to develop new, synthetic data samples 923 that can pass for real data. The development of methods to scrutinize conversational 929 agents' privacy protection as well as strategies to increase 930 the agents' resistance to malicious attacks and/or data theft 931 are the main objectives. In that case, AML, or Adver-932 sarial Machine Learning, is a new area of research that 933 combines the latest machine learning techniques, informa-934 tion systems security, and robust statistics can aid to solve 935 security-related problems. [113] explored attack/defense tac-936 tics for adversarial recommender systems using genera-937 tive adversarial networks. [114]