Conversational Affective Social Robots for Ageing and Dementia Support

Socially assistive robots (SAR) hold significant potential to assist older adults and people with dementia in human engagement and clinical contexts by supporting mental health and independence at home. While SAR research has recently experienced prolific growth, long-term trust, clinical translation, and patient benefit remain immature. Affective human–robot interactions are unresolved and the deployment of robots with conversational abilities is fundamental for robustness and human–robot engagement. In this article, we review the state of the art within the past two decades, design trends, and current applications of conversational affective SAR for ageing and dementia support. A horizon scanning of AI voice technology for healthcare, including ubiquitous smart speakers, is further introduced to address current gaps inhibiting home use. We discuss the role of user-centered approaches in the design of voice systems, including the capacity to handle communication breakdowns for effective use by target populations. We summarize the state of development in interactions using speech and natural language processing, which forms a baseline for longitudinal health monitoring and cognitive assessment. Drawing from this foundation, we identify open challenges and propose future directions to advance conversational affective social robots for: 1) user engagement; 2) deployment in real-world settings; and 3) clinical translation.


I. INTRODUCTION
W ITH an ageing population set to double by 2050 worldwide, and the number of people living with dementia expected to reach 152 million by then, tripling today's figures [1], the global socioeconomic burden and strain on healthcare systems is only expected to become more critical with time. Prevalence of dementia is further skyrocketing in low-middle income countries (LMICs), where 63% of people with dementia (PwD) already live [2]. As a chronic neurodegenerative condition, demands for dementia care increase over time. Recent studies estimate a staggering 1 in 4 U.K. hospital beds is occupied due to a dementia-related condition [3]. Global care costs of dementia are projected to exceed U.S. $2 trillion/annum by 2030, demanding 40 million new care workers, which could easily overwhelm medical and social care systems as they stand today [3]. This global health crisis has been exacerbated by the COVID-19 pandemic, with vulnerable populations facing further limits in care, family support, isolation, and pronounced mental health decline [4], [5]; COVID-19 has caused unprecedented stress, fear, and agitation among the seniors, especially those with cognitive impairment or dementia [6]. While the impact of these additional challenges will have long-lasting consequences well beyond the pandemic, they have also triggered the increased use of technological tools among older populations and opened new avenues for mental health telemedicine, including robotic solutions for human-robot cognitive engagement [7], [8].
The 2020 World Alzheimer Report estimated the majority of PwD live at home (60% in the U.K. and 80% in the U.S.) and wish to remain there [9]. Previous research has argued decreased communication time can lead to the acceleration of dementia [10]; stimulating communication is a priority for ageing and dementia support. Thus, there is immediate and urgent call for accessible, deployable solutions to provide: 1) personalized assistance; 2) mental health and dementia support; 3) health monitoring; 4) companionship; and 5) cognitive stimulation, therefore prolonging independent and healthy ageing at home.
Socially assistive robots (SAR) are well documented for promise to support ageing and dementia (see [11]- [17] for recent reviews). A range of intelligent social robotic platforms from mechanically complex [18], [19], mobile [20], [21], embodied humanoid SAR [22], [23], to pet-like robots [24], [25], simpler virtual assistants [26], This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ and commercial smart home speakers (e.g., Amazon Echo and Google Home) has been introduced with strong potential to address the aforementioned needs. The literature has argued social robots can help improve PwD's social engagement [27], attention [19], and neuropsychiatric symptoms through cognitive stimulation [28]; reduce agitation, stress [27], and depressive symptoms [23]; increase cortical neuron activity [29]; and establish rapport with end users [30]. Yet, long-term user compliance, deployment in the wild, and clinical translation remain largely untapped; the development of appropriate robots for dementia is at an early stage with fewer examples beyond research studies [11]- [13].
Unlike intelligent assistants or ubiquitous smart speakers, affective social robots encompass a broad spectrum of verbal and nonverbal interactive modalities (i.e., multiple channels of human-robot input/output), such as facial expressions, speech, gestures, or behavior. Speech is largely considered the most powerful communication mode for social robots to engage with end users [31]. Therefore, SAR with conversational interfaces hold promise to provide homebased cognitive support to older people with and without dementia [32].
Endowing robots with spoken language capabilities is paramount for engaging human-robot interactions (HRI). This includes the ability to recognize and process natural language [i.e., speech recognition and natural language understanding (NLU), respectively], respond accordingly (i.e., speech synthesis), and manage the conversation flow (i.e., dialog management). The investigation of robots that use natural language as well as emerging AI approaches for spoken dialog systems has experienced prolific growth in the last years [33]- [36]. Furthermore, significant efforts are being made in the development of AI algorithms to automatically predict cognitive decline and detect early signs of dementia using speech and language patterns through machine learning (ML) techniques [37]- [41], providing a strong foundation for home-based longitudinal cognitive monitoring, assessment of age-related cognitive decline, and dementia progression.
Despite the success of SAR and spoken dialog systems in their respective research fields, integration of the two is still rare, especially when applied to older adults and PwD. While other surveys have assessed assistive robotic technology and automation for ageing and dementia support [11], [12], [17], [33], [42], including a 2020 survey on home-based healthcare robots to provide cognitive support to seniors [43], we are not aware of a review specifically targeting the depth of the field in relation to conversational capacity, multimodal affective communication, and user engagement. We believe recent proliferation as well as long-standing acceptance of need make a focused review timely and necessary. This review aims to fill said gap by surveying conversational affective social robots specifically targeted at supporting ageing and dementia, in both social and clinical scenarios. These systems are widely acknowledged as having significant future research and clinical potential to support independence in the home, telemedicine, and isolation during and beyond COVID-19 [7], [44], [45].
The remainder of this article is organized as follows. A categorization of voice interfaces, the main assistive roles of SAR Categorization of voice interfaces applied to ageing and dementia support. These can either be embodied social robots with integrated spoken language capabilities, e.g., (a) humanoid robot NAO, (b) robotic platform Kompai, (c) social robot Jibo; AI voice assistants and smart speakers, e.g., (d) Apple Siri, (e) Google Home, and (f) screen-based Amazon Echo. These AI systems make use of five main components to understand user intent and generate a relevant response.
for the target population, and design trends are presented in Section II. Section III comprehensively reviews the state of the art and current applications of key conversational robotic systems. A horizon scanning of AI voice-based technology for healthcare and their potential to assess cognitive decline is introduced in Section IV. A discussion of current open research challenges and potential future directions are outlined in Section V. Finally, conclusions are drawn in Section VI.

II. CONVERSATIONAL SOCIAL ROBOTS FOR TARGET USERS
The benefits of bringing natural language capabilities into the robotics field have been extensively explored [33]. Voice interaction for older adults and PwD can be categorized in two main areas: 1) conversational social robots, which we consider as having a physical presence and a combination of affective communication modalities (e.g., facial expressions, gestures, or body movements), integrated with natural language interfaces and 2) virtual AI assistants and smart speaker technology, which rely essentially on voice. Commercially available instances of the latter include devices, such as Amazon Echo and Google Home, with constantly evolving AI capabilities to go beyond information retrieval and provide personalized, engaging interactions. Fig. 1 shows the categorization of voice interfaces for engaging HRI outlining the main functional components of dialog systems. While a detailed analysis of these is beyond the scope of this review, natural language conversational systems integrate: 1) automatic speech recognition, which converts input speech into textual vector representations; 2) NLU of user intents; 3) dialog management to interpret information and keep track of the (a) GrowMu [46], (b) Pepper from SoftBank Robotics [22]. (c) Hobbit [20]. (d) RAMCIP [47]. (e) CompanionAble [21]. (f) Kompai [48]. (g) Silbot [49].

A. Socially Assistive Robotics for Mental Health
Assistive technologies and cognitive robotics in the area of ageing and dementia care have seen rapid growth in the last two decades with various SAR prototypes specifically designed for older adults. Current technologies used in dementia care have focused on assisting users in their early stages to remain independent, improve social engagement, safety in the home, and monitor health and wellbeing [60]. The spectrum of functionalities and assistance ranges from managing medication or appointments [53], intelligent reminders [61], information provision [62], cognitive stimulation and engagement [19], to video calling with relatives [21], help in carrying out activities of daily living [63], improving mood and wellbeing [59], or autonomously detecting dangerous situations (e.g., fall detection and prevention) [20]. Overall, SAR can provide a tractable way to continuously monitor the home and enhance mental health and wellbeing, in addition to reducing the caregiver burden. Fig. 2 shows relevant examples clustered by mobile (i.e., capable of autonomous or teleoperated navigation in users' homes or nursing facilities) and smaller, affective robots (i.e., often simpler in their mechatronic design, stationary and portable robots that mainly interact through conversation, and affective markers). Four main roles of SAR can be identified specifically targeting mental health and cognitive support of target populations, which will be used throughout this review. 1) Companionship: The ability to engage or entertain users through affective interactions, including meaningful conversations, recommendation of activities, information retrieval, telecommunication, ultimately reducing feelings of social isolation.
2) Health Monitoring: The ability to check user overall health and wellbeing, give reminders (e.g., medication), detect health patterns over time, and inform caregiver or clinician upon emergency situations. 3) Cognitive Stimulation: The ability to provide cognitive training (e.g., cognitive games); evaluation of performance over time; ability to provide early detection and warning of cognitive decline; or to slow down the progression of dementia. 4) Clinical Therapy: Robotic solutions to conduct or assist in cognitive interventions and therapy, either operating autonomously or through teleoperation; these include telemedicine platforms.

B. Review Protocol
A search for studies published between 2000 and 2021 was conducted using scientific publication databases, such as IEEE Xplore, ACM, PubMed, Google Scholar, and search keywords related to: "social robot," "conversational robot," "elderly," "dementia," "verbal communication," and "dialog system." Our inclusion criteria consist of robotic systems that have: 1) been tested or evaluated with the target population (i.e., older adults, PwD) and 2) used an embodied social robot with: a) a face; b) speech interfaces (Fig. 1); and c) affective markers (e.g., facial expressions, gestures), therefore being able to engage with users in multimodal affective interactions. Papers were excluded if: 1) results from testing or qualitative evaluation with end users were not included; 2) presented work on SAR but speech-based interaction was not available; 3) the robot was unable to verbally speak (e.g., Paro, the robotic seal [64]) even though being capable of speech recognition (e.g., AIBO, the robotic dog [65]); and 4) the robot was covered and expanded on in a more recent publication. We here select the most recent exemplary study of each robotic platform. In the case of different studies published on the same year using the same robot, journal articles with higher number of participants and of longer HRI duration were prioritized. A set of 30 key studies using different robotic platforms evaluated or tested with the target population was selected and comprehensively reviewed (see Table I).

C. Design Trends of Social Robots
In order to prove effective and useful for ageing and dementia care, it is important that SAR are designed with end users, beyond evaluation, or pilot testing, to ensure that their needs, concerns, and preferences are met. Therefore, a user-centered design process should be followed to achieve the functional, clinical, and sociable features desired by older adults, caregivers, and clinicians [66], [67]. This section presents overall design trends identified from the 30 key robotic platforms reviewed and discusses characteristics of human-robot interventions.
1) User-Centered Design: There has been an overall increase in research using SAR with voice interfaces to support the target population over the last two decades (Fig. 3). Additionally, there has been a growing interest in employing user-centered approaches for the design process of social robots in order to meet the needs and preferences of end users and achieve longer term adoption and compliance; whilst of the 16 robots reviewed between 2000 and 2018, only nine followed a user-centered design approach, this figure improved to six (out of 14) robots designed specifically to support ageing and dementia just in the last two years. Indeed, the development of useful SAR for target users requires their engagement in all levels of design and implementation, not just for subsequent pilot trials [60].
Older adults and PwD are increasingly willing to take assistive robots to their homes. A 2017 survey [68] conducted on older people with cognitive impairments has revealed 80%+ of respondents were willing to entrust themselves to the care of a robot, and more than 75%+ of caregivers would agree to leave a patient alone with a robot. Interestingly, studies in Japan using a SAR have suggested older citizens were often more comfortable with assistance from robots than human caregivers [69]. Despite the increased interest and acceptance of social robots among older adults with and without dementia, several ethical concerns often lead to reticence in adopting robotic technologies. These include reduced human contact, loss of privacy, emotional deception, which occurs when users' expectations of the robot are not met, and attachment to the robot, which may cause emotional distress [70]. Thus, it is imperative that efforts be made to design ethically safe conversational SAR, ensuring trust, transparency, and patient safety [71].
2) Robotic Platform Design: Recently, there has been a trend in designing smaller, mechanically less complex, often stationary, potentially more affordable robotic platforms to support ageing and dementia care; whilst around 25% of the studies reviewed between 2000 and 2018 used a simpler and smaller robot, this figure increased to over 71% in research from 2019 and 2020 [ Fig. 4(a)]. This tradeoff in design facilitates the deployment of such SAR systems worldwide, including in LMICs, yet requires optimization of affective robotic modalities, such as facial expressions, gestures, and importantly, effective speech interfaces. Smaller affective robots are easier to deploy in real-world settings (i.e., homes) and are often associated with lower costs. Hence, this trend may indicate the application of conversational SAR systems for in-home cognitive and mental health support of the target population is growing.
Furthermore, previous research has argued static, smaller, portable, and friendly looking robots may be more trustworthy and acceptable by end users instead of bigger, mobile platforms, since the latter can be seen as a threatening obstacle to older users and imply more ethical and technical limitations, especially for use in private homes [72]. Importantly, for user acceptance and trust, the robot's appearance must be correlated with its functionality [68].
In relation to the design of simpler, affordable, yet expressive SAR, the implementation of an LCD screen for the robot's face has been argued to facilitate customization, adaptability to users' preferences, broader accessibility, and cultural sensibility of facial expressions [7], [73], [74]. Over 53% of the key SAR platforms reviewed employed LCD screens, either as part of the robot's physical appearance and expressiveness (e.g., [47] and [75]), or as an external source of human-robot input/output (e.g., [22] and [55]).
Additionally, the combination of digital features allows extended functionalities, such as user touch input, the display of reminders, teleconferencing, or visualization of the robot's speech-which has been argued as an essential feature to include in conversational robots targeted at older adults in order to avoid misunderstandings [76]. Simpler affective robotic platforms are also more likely to avoid the uncanny valley effect, a hypothesized relationship between the degree of an object's resemblance to a human being and the emotional response to such an object [77]. The uncanny valley may also result from a mismatch between the user's expectations and the robot's components in any modality, including its physical appearance, which ultimately influences trust [78].
3) Roles of SAR for Mental Health Support: From the four main roles of SAR to support mental health previously described (Section II-A) and considering that each SAR may combine multiple roles to assist end users, the vast majority of the papers included in this review targeted both companionship (41%) and cognitive stimulation (35%) [see Fig. 4(b)]. Some of the robotic platforms reviewed were able to monitor user's wellbeing and overall health (14%). Importantly, the role of assisting patients with cognitive impairment or dementia in clinical therapies was the less common (10%), targeted by five robots only (with published results of SAR interventions): 1) NAO [23]; 2) Eva [58]; 3) Sota [59]; 4) Bandit [19]; and 5) Ryan [56]. There is a clear gap in clinical translation of SAR platforms particularly for dementia care.
More than half of the reviewed studies (60% or 18) involved small or limited sample sizes. Surprisingly, 15%+ of social robots with speech interfaces were trialled with 40+ participants, as shown in Fig. 5. However, none of these involved continuous HRI over long periods of time; instead, they encompassed a single session [79], one or two sessions per week over up to four months [59], [80], and occasional short interactions with robots placed in care facilities for up to two weeks [46], [81] (in these studies, interested users would come closer and greet the robot). Importantly, only two of the key studies with large sample size were conducted in a real-world care setting with the robot working autonomously (see [46], [81]).

5) Culture and Language Adaption:
Previous research provides evidence that acceptance of SAR is influenced by the perceived usefulness and need for the technology, user's expectations about its functionalities, previous experience with similar tools, as well as user's culture [15], [87]. In fact, some evidence suggests people from different cultures have different assumptions about robots [88] and biases may be verified toward robots perceived from the same culture as the user [89]. This suggests the design of affective SAR may need to be culturally tailored for robust and engaging HRI. This cultural dimension regarding older populations and PwD requires further language flexibility. However, cross-cultural testing of social robots and the development of culturally aware HRI is still rare (see [7], [90] for state-of-the-art examples in this new research arena).
Depending on which country the study took place in, speech engines covering diverse languages have been used, including English [75], Spanish [55], Italian [23], German [50], Dutch [46], French [22], Greek [20], Polish [47], Swedish [20], and Japanese [83] (see further details in Table I). The vast majority of key papers included spoken interfaces without possibility of multiple languages. To the best of knowledge, only five robotic platforms were able to understand and reply in more than one language (e.g., [20], [23], and [75]). Nevertheless, the integration of off-the-shelf AI software for spoken interfaces using cloud-based solutions has recently become common [54], [57], [58], allowing robotic systems to understand and synthesize multiple languages or dialects. Conventional services include Google Speech to Text API, IBM Watson, Microsoft LUIS, and Amazon Lex, among others. Importantly, only three of the reviewed human-robot studies using conversational SAR platforms included some form of cultural adaptation or sensibility [46], [49], [80]-much remains to be done in this arena.

III. STATE OF THE ART
This section details comprehensively the key HRI studies, which have designed, evaluated, or tested different SAR platforms with speech-based interfaces targeting older adults and individuals with cognitive impairment. Table I  Q1: Has the robot followed a user-centered design, in that has it been designed specifically to support the target population (older adults and PwD), or have end users been involved in the design process of such social robot?
Q2: Does the robot include some form of personalization to user profiles or preferences?
Q3: Is the robot capable of two-way autonomous conversations?
Q4: Does the robot demonstrate a design tradeoff in its mechatronic components, in that have cost and ease of use been considered in the design process? This would facilitate worldwide deployment and scalability. It should be noted the development of simpler robotic platforms must not compromise the desired affective and functional abilities of SAR.

A. Mobile Social Robots
There is a body of work that has tested the acceptability and feasibility of mobile robotic platforms with embedded spoken language capabilities to promote independent living and wellbeing of seniors. One such system is GrowMu [46], able to navigate autonomously, manage users' daily routine by interfacing with caregivers and videocall, and retrieve user information over time to foster personalization. This robotic platform, aimed to be affordable, is around 1.3-m high, with a touch screen on its chest, a head with facial expressions displayed with the eyes and mouth made of LED lights [ Fig. 2(a)], a camera for real-time face detection, and speech interfaces. In a week-long pilot conducted at an aged care center, the robot was perceived as friendly and accessible by end users who briefly interacted with the robot. Issues with the robot's speech recognition were reported, as it was only capable of detecting a limited set of words (even so, the speech recognition accuracy was lower than 66%). Furthermore, the robot was limited to predefined text-to-speech (TTS) recordings to interact with users, unable to maintain autonomous conversations, and its facial expressions were not aligned with verbal input/output. A field trial in 18 older adults' homes over three weeks was carried out with Hobbit [ Fig. 2(c)], a SAR with entertainment (e.g., brain training games), and safety functions (e.g., fall detection and prevention), which followed a usercentered design and low-cost approach [20]. Although the robot was found to be useful to assist end users at their living environment, several of its functions lacked stability over time and its speech recognition engine did not work well for many users. Another SAR pilot tested in a real-world scenario was SCITOS [81], with a static face made of eyes and a touch screen. The robot was placed in a care hospital in a five-day-trial followed by a five-day-pilot testing to investigate the perceived user experience by older adults and staff. Findings showed that the perceived utility of the robot depends on what tasks it provides and its proper functioning. While an overall interest in the robot's functionalities was reported, staff were reluctant to share their workspace with a robotic agent. Furthermore, the robot's interactive and autonomous capabilities were limited; for instance, it lacked a speech recognition engine for meaningful two-way communication, which the authors state as a fundamental feature to develop in future work.
The technical and communicative challenges of speech interaction between a SAR and ten older adults with dementia was examined in [63]. The robot ED, with a body and head components, including an LCD monitor for audiovisual prompts or display of a simplistic digital face, was tested in a simulated home environment. Using a Wizard of Oz (WoZ) approach, the teleoperated robot guided participants to locations and provided instructions, using prerecorded prompts, to assist in daily activities, such as hand washing and tea making. The aspects of conversation and language used by PwD with the robot were quantitatively analyzed, as well as the efficacy of speech recognition in such context. This study revealed that speech recognition remains a major open challenge for HRI with PwD. Conversation repair features are needed to overcome user confusion, lack of interest, and conversation breakdowns. To achieve effective speech-based assistance with   , with a static face, touch display, and voice interface, integrated with smart home technology (e.g., presence sensor and light control), was tested in a simulated home environment for two consecutive days with 11 participants, including PwD and caregiver dyads [21]. The system provided cognitive and social support through appointment reminders, recommendation of activities, videocalls, and cognitive games; results from the latter are intended to be transmitted to a therapist's database and used to track cognitive health over time and adjust the therapy. Through qualitative semistructured interviews and behavior assessment, the robot was perceived as useful and enjoyable. Benefits of reduced caregiver burden were reported too. The robot's initiative (e.g., by navigating to meet users or giving reminders) was considered the most useful feature. However, the speech recognition engine had to be deactivated by the second trial day given its poor performance. Longer term effects on wellbeing beyond trials were not evaluated. Despite success as an engineering prototype, the robot did not leave the controlled scenario, was tested with a small sample size, and the research project has now finished.
Along similar lines, Caleb-Solly et al. [48] investigated usability and user experience issues with the Kompai robotic platform [ Fig. 2(f)], in a two-day trial with a total of 11 older adults in different environments (controlled laboratory, care facility, and real homes). The authors argued end users must adapt behavior to the robot's feedback in order to facilitate HRI, in addition to the system adaptability and personalization to user profiles. This may be reflected not only on the optimization of the robotic system but also on the user experience. A phased introduction and learning of the robot may also enhance acceptance. The lack of robot autonomy and concerns regarding technical stability over time prevented long-term trials. Issues with voice interaction were further reported, especially with regards to poor speech recognition; synchronization errors occurred with the system listening mode, as many users started to answer before time. One important design limitation of such robot is the lack of dynamic display of facial expressions. This mobile robotic platform with speech interfaces has recently been investigated to deliver personalized geriatric assessment and reminiscence therapy in dementia care, in addition to support caregivers to assess patients' cognitive status (MARIO EU project 2 ) [92]. Yet, results of robot-assisted trials have not been published to date.
A usability and acceptability study with a larger sample size of 72 healthy older adults was conducted with Doro Robot-Era [79], a multimodal platform designed for older adults with a static face, LEDs on the eyes, a detachable touch screen tablet, and ability to maintain a two-way verbal interaction autonomously, yet for simple commands and taskoriented dialogs only (e.g., food delivery service). Individual sessions with the robot were video recorded for behavior analysis, including user gaze and the total time spent looking at robot or tablet as indirect measures of attention, in addition 2 http://www.mario-project.eu/portal/ to postsession qualitative questionnaires. Despite the overall positive impression, participants experienced predefined interactive scenarios at the test set, instead of the robot's full functionality. Issues with the speech interface were reported. The authors claimed that multimodality is an added value to the robotic system and essential for increased acceptance and usage among end users, especially older people less experienced with technology.
The humanoid robot Bandit was used in cognitive therapies with 10 PwD for six months [19]. The robot was able to improve participants' engagement, attention, and performance in a music game via facial expressions, gestures, and prerecorded audios of a human voice, making use of an adaptive framework such that the difficulty of the game would adjust to the abilities of each participant. The robot has further been used as a coach to engage 33 older adults in simple physical exercises as SAR-based therapeutic intervention [93]. Other key studies here reviewed, which used SAR mobile platforms with diverse functionalities, include Silbot [49], Pepper [22], robotic assistant for mild cognitive impairment patients at home (RAMCIP) [47], Robovie [80], and Pearl [61]. Yet, these were either limited to small sample sizes, short-term exploratory trials, conducted in controlled environments rather than in real-world ones, with limited autonomy or fully teleoperated (WoZ), and in some cases did not involve actual interactions with the robot (see details in Table I). Furthermore, no evidence has been shown on the efficacy of these robots' speech-based interfaces for meaningful conversations with older adults and PwD.

B. Small Affective Social Robots
In the aforementioned studies, verbal communication was not the main function of social robots. More recently, however, there has been an increased body of work exploring cognitive stimulation and monitoring through conversation with older adults with and without dementia (see Table I).
A feasibility study was carried out using a conversational robot for cognitive assessment of 19 seniors with dementia in picture description dialogs [57]. Experiments with three conversation partners were cross compared: 1) human interlocutor; 2) robot Milo remotely controlled (WoZ approach) [ Fig. 2(o)]; and 3) robot working in autonomous mode, equipped with APIs for automatic speech recognition and rule-based dialog management. Although the humanoid robot was capable of interacting through additional realistic facial expressions and gestures, these were not incorporated in robot-assisted tasks. The analysis of interactions involved a Kinect sensor for facial feature recognition, linguistic feature extraction from audio-recorded transcripts, qualitative questionnaires, and behavior observation. As a proof of concept, the authors demonstrated how linguistic analysis could be used in longitudinal monitoring and assessment of dementia; lexical features were automatically extracted. Whilst conversations were more engaging with a human interlocutor, there was an overall likeness toward verbal interactions with the robot. The human and the remotely controlled robot [scenarios 1) and 2), respectively] were able to respond to open-ended or flawed answers, sense when to finish conversation, and interpret affect from facial expressions, which was not verified for the autonomous robot [scenario 3)]-the robot was limited by inflexible dialogs, unable to personalize conversations, or recover from breakdowns. The authors argued conversational robots may be more appropriate to milder cognitive impaired individuals and highlighted the need for improvements in automatic speech recognition technologies to handle responses of older adults with mild cognitive impairment (MCI).
The commercially available social robot PaPeRo communicates with voice, touch interface, and gestures, shows facial expressions with LEDs on its mouth, and recognizes voices and emotions. A longitudinal study on engagement and acceptance by 115 PwD living in Australian care residences [51] and subsequent trials in five home settings [82] showed significant improvements in emotional, visual, and behavioral engagement. One relevant feature was the robot's capacity to automatically recognize affect from text input and adapt facial expressions and body movements. The robot could understand natural speech, but no specific details were given about its capacity for conducting autonomous, meaningful human-robot conversations. The authors highlighted the need to underpin user-centered design of SAR to suit individual preferences, changing needs, and health conditions of older adults and PwD. Another instance of a commercial social robot, NAO 3 , has been used as a cognitive stimulation therapy tool for PwD with improvements in neuropsychiatric symptoms [28] and as a memory trainer for individuals with MCI showing increased attention and less depressive symptoms [23]. The robot has also been evaluated in 14 Dutch nursing homes by providing entertainment and stimulating physical activities [94].
Recent research explored how to adapt a robot's linguistic style based on explicit human feedback, ultimately targeting personalized responses over time [50]. In this article, ML is used for iterative learning based on a reward signal (i.e., reinforcement learning approach); this included two robot personas for information retrieval activities or games, as well as eight possible politeness strategies for recommendations given by the robot; a set of scripted utterances was used for each task and action triggered based on explicit user feedback. The autonomous robot Reeti [ Fig. 2(i)] was perceived as attractive and easy to use in a preliminary study conducted in the homes of two participants for one week, where feedback regarding the robot's spoken style was given via physical buttons on a control panel (the only interface for human input). The system's requirement of additional hardware may limit its scalability. In addition, the robot lacked an NLU engine. Although this was found as a useful feature to add, privacy concerns were raised.
Another personalization approach is the one followed in [58], where conversational strategies were implemented to handle breakdowns, as well as interaction scripts tailored to user profiles [95]. The robot Eva can work in autonomous or teleoperated mode, integrates cloud-based AI features for NLU, basic synthesis of speech and digital emotions with 3 https://www.softbankrobotics.com/emea/en/nao the eyes. One interesting feature is the personalized waiting time for user response, in that the system accounts for the average length of responses by each individual; this is particularly relevant when targeting older citizens with cognitive impairments and a common drawback of commercial voice technologies (e.g., Amazon Alexa has a default response time of 8 s. 4 ) Eight PwD participated in robot-guided cognitive stimulation therapy group sessions over nine weeks. The analysis focused on evaluating the impact on participants' behavior beyond the duration of HRI sessions. A quantifiable measure to assess dementia-related behavioral symptoms was used presessions and postsessions, complemented with qualitative feedback from caregivers. Findings showed a significant decrease of three dementia-related symptoms (delusion, agitation, and exaltation) and positive short-term effects in mood were reported after robot-guided sessions. While tailored to user profiles, sessions followed a script and were often perceived as repetitive and monotonous. Although the system processed utterances to generate a verbal response and display an emotion, the ability to automatically adapt facial expression based on what the user said and how it was said was not reported in the study.
Recent research has presented a preliminary study with the social robot Mini, specifically designed for in-home support of older populations [55]. With a cartoon-like appearance, affective markers, speech-based interfaces, a touch screen, and a knowledge base to store individuals' information and customize behavior, this desktop robot provides companionship and cognitive stimulation exercises. Its dialog modeling system considers two important variables to manage the flow of conversation and handle errors: 1) initiative and 2) intention. The robot was perceived by 20 end users as a useful tool to motivate toward daily activities; however, its ability to extend user independence was not recognized.
On the other hand, a small, desktop social robot aimed at conducting clinical screening and wellbeing assessment based on verbal communication was tested in single 50 min sessions with 30 healthy older adults [54]. The robot was equipped with a touch screen, face detection, speech-based interfaces, as well as automatic evaluation of participant's answers for wellbeing reports. This study reported an overall positive impression and high trust in using the social robot for wellbeing assessment, stressing its potential as a home-based screening tool for people in risk of developing dementia. Yet, it was limited to task-oriented (question-answer based) and predefined dialogs, not personalized to each user's profile or cognitive performance over time. Additionally, errors in speech recognition during spelling tasks were reported. This has further been explored as a medication adherence system using a mobile app and the Cloud [96].
The aforementioned studies indicate that upholding a conversation with older adults and PwD is a difficult research task, particularly because speech recognition often fails in human-robot dialogs. Aiming to address this challenge, recent research [86] has proposed a model to prevent disruption of dialog when speech recognition fails with older citizens in Japan, using a twin-robot dialog system. The robotic system takes initiative asking various questions in three topics of conversation in a coherent way, even when speech recognition is not precise. Furthermore, the teleoperated robot Telenoid has been trialed with five PwD in a care facility for ten weeks, as a tool to promote conversation and improve behavioral and psychological symptoms of dementia [97]. Magyar et al. [98] have further proposed an autonomous dialog system integrated with Telenoid that triggers the next spoken action (including conversation topic) by estimating the senior's emotion and motivation based on nonverbal cues, through the use of external sensors to extract emotional features (e.g., facial emotion). Using a reinforcement learning approach, the adaptive robotic system would trigger one of three actions: 1) short response (simple agreement, encouragement); 2) long response (question); or 3) topic change (a statement introducing a new topic), being able to maintain interactions with end users for 20 min.

IV. VOICE ASSISTANTS AND SMART SPEAKER TECHNOLOGY: HORIZON SCANNING
Although AI voice-based technologies were not directly a part of our inclusion criteria (Section II-B), there has been emerging interest in applying conversational agents, voice assistants, and smart home speakers in healthcare, including to assist older adults and PwD in the living environment. While this field remains in its infancy, we summarize a range of the current efforts in the context of this review. Voice systems have yet to produce research results for cognitive and mental health support of target populations, particularly with clinical utility. This is often pronounced given the convergence, and in some cases divergence, of commercial and academic research with comparable targets. Due to recent worldwide commercial viability and adoption, however, a great number of studies in this arena is expected in the near future. We believe future efforts should target integration of both affective SAR platforms and state-of-the-art conversational AI systems to: 1) maximize user engagement and 2) enable utility for ageing and dementia in home and clinical settings. Therefore, we present a complementary horizon scanning of AI voice technology for cognitive assessment and dementia support. We propose it as an area for deeper survey in the future since literature to date lacks real-world trials with target users.

A. Review Protocol
A search for studies published between 2015 and 2021 was conducted using scientific publication databases, such as IEEE Xplore, ACM, PubMed, and Google Scholar. The databases were searched using keywords related to: "conversational agent," "voice assistant," "speech interface," "smart speaker technology," "elderly," and "cognitive impairment." Inclusion criteria consist of studies that have used a conversational agent embedded in a mobile app (i.e., voice assistant) or smart home speakers (e.g., Amazon Echo and Google Home) to interact with older populations and individuals with cognitive impairment in domestic or clinical environments. Furthermore, we searched for studies that used speech and linguistic markers for assessment of cognitive decline or dementia.

B. Conversational Agents in Healthcare
Conversational agents are AI-powered systems that mimic human conversations by understanding natural language and generating relevant responses in the form of text, voice, or both [99]. There appears to be a lack of consensus regarding definitions of conversational agents, dialog systems, embodied conversational agents, chatbots, and smart conversational interfaces [14], [100]. Familiar examples include prominent voice assistants that have entered the market integrated in mobile platforms or smart speakers, such as Amazon's Alexa, Google Assistant, Microsoft's Cortana, or Apple's Siri. In light of their expanding AI capabilities and the increased access to users' contextual information coming from external sensors in smart home environments, the use of conversational agents has recently become more prevalent in healthcare applications. The literature suggests conversational agents may be effective delivering cognitive behavioral therapy [101]. Yet, evidence of efficacy and safety is still limited [14]. Additionally, there is a lack of studies on fully deployed voice assistants in the healthcare domain, with fewer examples beyond research contexts. Most studies reviewed in 2018 [14] were limited to task-oriented conversational agents to support patients and clinicians in highly specific scenarios (e.g., information retrieval and predefined clinical interview), restricting user input to predetermined utterances. The authors have highlighted the use of such technology in clinical trials needs to be carefully monitored, as more complex dialog systems and higher conversational flexibility come with higher risk for errors in the NLU, response generation, or user interpretation. In fact, evidence suggests conversational agents are not yet mature enough to reliably support healthcare; even when user statements explicitly contain risk or harm (e.g., "I want to commit suicide" and "I am depressed") inconsistent responses have been reported in [102].
The COVID-19 outbreak has spurred greater interest in the use of voice interfaces as a tool for remote healthcare delivery and support of high-risk populations, such as seniors and people living with dementia [71]. In addition to providing up-to-date, relevant COVID-19 information, voice assistants hold strong potential to support patients in need for routine care, such as health screening via conversations. Patient voice data could be further used as a biomarker for continuous monitoring of mental health. However, the readiness for voice assistants deployed in such real-world scenarios is challenging due to the following limitations: speech recognition errors; need for persistent Internet connection, user and organization compliance to exchange personal health information; erroneous or misleading information provided, which was a risk faced and mitigated during the COVID-19 pandemic [103]. Overall, further investigation is needed to test benefits of conversational agents in clinical contexts to ensure transparency and patient safety.

C. Speech Interfaces for Ageing and Dementia Care
Upholding a conversation with PwD involves many breakdowns in communication. Hence, prior research has proposed design guidelines to tailor voice-based interfaces for effective use by older adults with cognitive decline [104].
These systems must be able to: 1) handle user pauses and hesitations, especially in open-ended questions; 2) accept preemptive responses (e.g., when participants interrupt and start responding before the system finishes the sentence); 3) include instructions on how to recover conversation when error messages are verified; and 4) assist with confirmation of responses by combining voice with visually displayed messages on a screen. Intelligent conversational systems should also be able to identify the emotional state of the user and convey emotion in speech. While the literature in emerging ML techniques for affect recognition from speech and text is vast, including the use of cloud-based solutions for sentiment analysis [105], and cognitive assessment via automatic spoken language processing [39], the ability for a conversational agent to convey emotion in speech remains a major research challenge. Therefore, embodied SAR systems with multimodal affective cues (Section III) may be more effective in conveying emotions.
Recent literature has explored the benefits, limitations, and open research challenges of using voice interfaces with older adults and cognitive impaired individuals. A qualitative study based on three focus group discussions with healthy older people, PwD, and caregivers, using a simulated tablet-based assistant to help users navigate the calendar, stressed the importance of adapting interaction style to meet the needs, preferences, and cognitive decline of each user [106]. A premise warranting further investigation is that some end users questioned the acceptability of a voice system without a face. In [107], a prototype application has been proposed, based on Amazon's Alexa, to provide audio prompts with routine tasks for people diagnosed with dementia. In [108], ML algorithms have been applied to identify dialog-related confusion from speech with individuals with Alzheimer's disease; accuracies above 80% were obtained and learn policies implemented to avoid conversation breakdowns. Several linguistic features were extracted as verbal indicators of confusion (e.g., vocabulary richness, parse tree structures, and acoustic cues).
Understanding how smart speaker technologies are used in the home has received a great deal of attention in recent years. Daily patterns of conversational Alexa data usage over time have been explored [109]; qualitative studies to understand people's experiences with voice-enabled devices and why interest is oftentimes lost after the novelty effect have been conducted too [110]. In [111], several Alexa Skills have been implemented to assist eldercare at home, including medication alerts, a diet tracking system, and fall alerts sent to the caregiver of older adults. Along similar lines, Tan et al. [62] has proposed a smart home system composed of several Alexa Skills targeted at assisting older people and addressing the needs of Alzheimer's patients and caregivers, including depression screening, medication setup, and dressing assistance. Other Alexa Skills have been developed aimed at remote caring of older relatives [112] and support of early stages of dementia [113], yet no published results on user compliance or longitudinal analysis of Alexa interactions have been yielded. Furthermore, prior research reports a Google Home app to assist older adults with self-management of type 2 diabetes [114].
Despite increasing interest in this arena, research to date lacks longitudinal studies of frequent interactions with voiceenabled technology in the home environment. Particularly, the extent to which voice assistants can monitor health and wellbeing of seniors and PwD, as well as the consequent long-term effects in mental health and cognitive decline remain largely untapped. Moreover, little is known about how older adults perceive the benefits of this type of voicebased interaction. From a practical standpoint, commercial smart home speakers present additional challenges for effective verbal interaction with cognitive impaired individuals. Alexa, for instance, does not handle user hesitation or pauses when speaking. When testing Alexa Skills, after Alexa stops speaking there is only an eight-second window for the user to respond before a reprompt or end of session. 5 When the user pauses during response, Alexa stops listening and applies NLU with the given information, which may generate misleading responses. This may cause user confusion and frustration, therefore limiting acceptance and long-term use of such smart home devices. Ongoing research is targeting the development of more natural, fluid dialog interfaces and exploring interaction patterns to diagnose dementia [115].

D. Voice Technology as Bridge to Cognitive Assessment
Voice interfaces have been investigated as a potential way to detect cognitive decline and early signs of dementia from linguistic and speech patterns [37], [116], [117]. Previous research has explored the use of conversational agents with individuals with MCI and dementia; pause and utterance duration, pitch, frequency of head nods, and overall patient responsiveness in the conversation were used as indicators of cognitive status [118]- [120]. Particularly, simple linguistic markers, such as word choice, phrasing, and short speech patterns have shown predictive power in assessing MCI status in older populations [121]. Evidence suggests that diagnostic markers can be automatically derived from NLP and speech processing techniques from neuropsychological examination samples for further discrimination between healthy older adults and those with MCI [116]. Along similar lines, [122] implemented verbal fluency tests (standard cognitive tests clinically used to assess dementia) in a conversational agent and described the ML analysis to automatically extract features from speech and language in order to successfully differentiate between healthy controls and individuals with MCI. Furthermore, researchers are actively investigating ways to assess cognitive impairment and dementia using mobile applications [123], [124] as well as conversational SAR acting as a psychologist [125]. Overall, analysis from interactions with voice technology can form a strong baseline for longitudinal health monitoring, assessment of both age-related cognitive decline, and dementia progression over time. Future voice assistants and smart speaker technology may incorporate specific algorithms to identify linguistic markers with predictive power in assessing cognitive decline. This could expand their utility for supporting target populations. Further studies in this promising arena, including clinical trials, are warranted.

V. DISCUSSION AND FUTURE DIRECTIONS
Conversational agents have advanced significantly in recent years; however, few investigations have demonstrated direct impact in support of ageing or dementia. From use and clinical perspectives, there is a very strong need for systems that can interact with target users over time to: 1) store regular health information and adjust interactions based on changing health conditions; 2) provide intervention/support in situations of stress to positively influence wellbeing (e.g., acting on agitation or initiating a conversation in the event of confusion); and 3) offer tangible support in targeted activities of daily living to promote independence and relieve caregiver burden. We suggest the following areas in need of further studies in the field of conversational social robots and broader AI voice technology for mental health and dementia care (Fig. 6).
1) User Adherence: Long-term compliance regarding sustained use by target populations must be addressed. Studies drawing from user-centered design processes focused on fulfilment from the stakeholder perspective (i.e., positive engagement in use) and deployment in real homes hold promise to begin to address this gap. 2) Clinical Utility: The ability to gather meaningful data of use to clinicians is not established. The nature of information, which gives actionable insights, is poorly understood as well as the means of collecting such data. Further studies on how such data can be gathered out-of-clinic with tight feedback loops from medical and social care professionals to assess its utility are necessary. We suggest this is likely to be an iterative process; hence, small-scale deployment in specific clinically relevant areas of support offer a basis for broader efforts. Furthermore, we believe ethical issues around data gathering and modality (e.g., voice data are personally identifiable) must be addressed in the earliest stages of the development cycle, with close nontechnical feedback directing design to adjust to privacy and data protection concerns.
3) Intelligent Adaptation: Drawing from user engagement and medical utility, new research is necessary to establish the capacity to intelligently adapt to individual needs, preferences, and cognitive abilities. Longitudinal deployment can generate databases on specific needs, interests, and preferences of users, which should be used to tailor future automated conversations. Tools, such as speech analysis and NLP, offer a basis for making inferences on user state during conversation; however, ground truth is very difficult for comparison. In a similar manner, integration with sensors in smart environments (e.g., motion sensors and wearables) can also allow adaptation based on nonverbal indicators of physical or mental state. Algorithms tracking changes in language or speech over time, however, may give insight into mental health. This supports long-term personalized mental support and cognitive engagement. These challenges are common to both types of voice interactive technologies here addressed: SAR platforms able to communicate through natural language, often integrated with additional affective modalities (e.g., facial expressions and gestures), and pure voice-based technologies. We strongly believe these intelligent systems must be inherently coupled for enhanced user engagement and patient benefit; therefore, the challenges identified are discussed concurrently to foster future directions and research efforts in a very multidisciplinary arena. Finally, while automated interventions have the potential to improve health, they can easily be perceived as complex, controlling, denigrating, or simply unnecessary. If forced onto users, they may have a negative effect on their psychological health and wellbeing, which has hampered many efforts to date.

A. Effectiveness of Natural Language Interaction for Target Users
The development of natural and engaging verbal interactions to support older people with and without dementia remains a very challenging research task. First, voice interfaces evaluated to date lack stability and effectiveness over time; as seen in Table I, 40%+ of studies reported limitations of HRI due to lack of effectiveness and naturalness in the integrated voice system, including errors in speech recognition, speech synthesis, or both (e.g., [20], [23], [54], [57], and [86]). Other studies either unspecified effectiveness and autonomous ability of the robot's speech interfaces (e.g., [52], [59], [75], [80], and [85]), or used very simple, predefined recordings in repetitive, scriptbased HRI (e.g., [46], [58], and [79]). Challenges in developing practical speech recognition engines for older people at home are well documented [20]. SAR platforms and conversational AI systems lack autonomy to handle end users hesitation, confusion, and overall conversation breakdowns, which are reasonably common in the target population. Very few studies in Table I attempted to implement conversational strategies to handle dialogs with cognitive impaired individuals in an effective manner (see [55], [57], [58]). Commercially available voice systems, on the other side, encompass limited time for the user to think before responding, which largely hinders appropriate and regular usage by people with cognitive impairments or dementia. Hence, future design of conversational social robots for target users should encompass mitigating: 1) speech recognition errors; 2) user hesitation or frustration; and 3) repetitive dialogs that may lead to user disengagement. For this, the AI algorithmic integration of context awareness and effective conversation fallback strategies is fundamental.

B. Call for Adaptive Frameworks
Future work may target the implementation of flexible NLP and more advanced dialog management systems to achieve longer, more natural, and engaging HRI. Many of the studies surveyed do not allow mixed-initiative verbal interaction (i.e., when both the user and the robot can lead the conversation), therefore do not overcome the command-only barrier, a well-acknowledged limitation of several social robots and voice systems [126]. In fact, various conversational SAR here reviewed were limited to conversation templates, simplistic spoken prompts, often restricted by system-led, repetitive dialogs, in highly controlled environments (i.e., less dynamic and noisy), or in some cases were remotely operated. In addition, ML frameworks for adaptive HRI to user profiles, preferences, and needs over time remain underexplored. The robot's linguistic style, persona, and voice intonation should be personalized to individual profiles and cognitive status to guarantee adherence to the technology beyond the novelty phase. A knowledge base with training data may be implemented and automatically updated over time to achieve this. Surprisingly, 30%+ of the key robotic platforms comprehensively reviewed (Section III) included some form of personalization or adaptive behavior; this included adapting difficulty of a cognitive game to the abilities of each participant [19], music personalization [51], adapt a robot's linguistic style based on explicit human feedback [50], personalize assistive activities based on the level of user engagement in a task [85], incorporate a database with user profiles to retrieve user information [46] and, to some extent, tailor conversations accordingly [55], [58]. The design of conversational SAR and voice assistants calls for further cross-cultural and language adaptation-including the ability to process and synthesize multiple languages-to facilitate worldwide deployment. Taken together, the development of adaptive frameworks for HRI would be key to address the compliance issue.

C. Multimodal Affective HRI
Another open challenge identified is the ability of a robot to combine its multimodal affective cues, both verbal and nonverbal (i.e., implicit communication through facial expressions, gestures, or body language). This could contribute toward more trustworthy, engaging, and empathic interactions from the user perspective. Ultimately, the robot would sense the user's mood/behavior and adapt response suitably, in real time. Researchers are actively exploring multimodal SAR systems, including gesture, gaze, and touch-based HRI for supporting older adults and PwD [127]- [130]. Recent progress has particularly been made to integrate virtual and embodied robots with meaningful facial expressions for enhanced engagement in HRI [7], [74]. Yet, further user-centered studies are needed to understand how additional affective modalities may contribute to, or in some cases jeopardize, dementia care at home and clinical environments. Exemplary SAR platforms of the past two decades further demonstrate research efforts to combine affective markers; of the 30 key robots reviewed in Sections II and III, 60% (or 18 platforms) include the ability to display robotic emotion through digital or physical facial expressions, over 45% (or 14 platforms) are capable of gestures/body movements, and around 27% (or 8 platforms) include both facial expressions and gestures. Yet these have rarely been integrated with the robot's verbal communication. Note a few robotic platforms do not clearly denote ability to show facial expressions or gestures, therefore were not counted above.

D. Benchmarks for Robot Acceptance and Usefulness in Healthcare
Robotics research lacks clear benchmarks to measure the usefulness of SAR in healthcare contexts [131], in particular for dementia care. Important questions need to be addressed, such as: what real effect do robotic agents have on long-term quality of life, cognitive abilities, and user overall wellbeing? How to address the compliance issue? How well do conversational robots detect subtleties of language, tone, and context that may signal a risk for patient harm? How well do they integrate with other home sensors or devices to trigger further action on patient safety? How are data from voice-based interactions managed to ensure privacy and the development of ethically safe robots? Furthermore, a specific, validated, and objective model of robot acceptance is needed.

E. Longitudinal, Realistic, Randomized Trials
Long-term, continuous human-robot data in realistic environments is clearly lacking in the literature, in particular for clinical applications. Of the four SAR platforms tested with end users over a period of more than three months [19], [57], [80], [82], HRI were often limited to one session per week, a small number of sessions per year (see Table I), or did not involve deployment in the wild (e.g., homes); only five HRI studies were both conducted over multiple weeks and included at least a medium sample size (i.e., 20+ participants) [23], [53], [59], [80], [83]. Indeed, generalizability or reproducibility of past HRI studies may often be compromised by sample size (Fig. 5). Overall, we recommend future efforts concentrate on continuous HRI over long periods of time, especially in user homes and clinical settings. One major obstacle for longitudinal trials, however, lies in obtaining ethics for trials with vulnerable populations. Privacy concerns and information governance for handling sensitive patient data (e.g., voice, speech content, and facial expressions) need to be carefully surmounted. We further recommend longitudinal trials include an acclimatization period through repeated HRI in order to: 1) enhance engagement and 2) minimize user concerns related to data privacy or emotional deception, which commonly occur when expectations of the robot are not met.

F. Translation Into Clinical Applications
Clinical translation of conversational robotic technology remains immature. Only 5 of the set of 30 conversational robots were aimed to support users through cognitive stimulation, therapy, or clinical screening (see details in Table II). The mobile robot MARIO (Kompai platform) [92] has also been argued as a potential tool to deliver reminiscence therapy to older adults and PwD, yet results from field tests with target users have not been published. Robotic assistive technology able to provide in-home cognitive support or clinical therapy for those with degenerative brain diseases has shown little empirical research to date. To meet the future clinical needs in dementia care, much more remains to be done, as these therapies still have inadequate ecological validity and oftentimes unproven outcomes.
We believe the current dearth of clinically useful and appropriate robots for ageing and dementia is particularly pronounced due to existing research platforms not being commercially available for wider adoption. In addition, it is still unclear what type of robotic platform and affective communicative modalities is preferred for clinical applications with dementia patients (e.g., purely voice, versus combined affective face and voice, versus embodied robot, versus digital robot). While we expect further studies in this arena to come in the near future, we strongly believe multidisciplinary collaborations between roboticists, clinicians, and target users are key to overcome this important gap. Furthermore, we recommend HRI studies aimed at supporting mental health and dementia target a simple application where a difference in a specific aspect of the user's life routine-either social, physical, or psychological-is guaranteed, and take incremental steps from there. Importantly, end users feedback should be considered at all iterative design stages of conversational robots, which could be achieved through patient public involvement (PPI) with various focus group discussions.

VI. CONCLUSION
Worldwide, the ageing population has caused a marked increase in the number of people with cognitive decline linked to dementia. Preserving cognition and mental health, including cognitive stimulation through verbal communication, is critical to ageing with autonomy, independence, and wellbeing in the home environment. Conversational affective social robots and AI voice technology (e.g., ubiquitous smart home speakers) hold significant promise to assist older people and those living with dementia in social and clinical contexts. Yet, long-term user compliance, effectiveness in the home environment, and translation into clinically useful applications call for further investigations. The vast majority of key human-robot studies here reviewed either described pilot studies with the robot placed in aged care residences for short periods of time, working with limited functionality and autonomy, or were conducted in highly controlled settings (e.g., laboratories and simulated home environments) instead of real-world assistive environments (e.g., homes). There is an apparent gap in longitudinal and effective use of SAR with natural language capabilities for ageing and dementia support in realistic settings. Furthermore, there are few randomized clinical trials and lack of benchmarks to examine the efficacy and utility of conversational systems, particularly for dementia care.
This comprehensive review highlights that although there has been an increase over the years in research using conversational robots, effective speech interfaces for interactions with older adults and cognitive impaired individuals remain underexplored. We recommend future efforts address the following open challenges: 1) unconstrained NLP and conversational strategies with adaptive frameworks tailored to user needs, preferences, cognitive abilities over time, and potentially culture are needed for enhanced engagement in HRI; 2) as we move toward conversational affective social robots, we need more robust models to achieve meaningful two-way conversations with target populations, including the ability to handle user hesitations and recover from conversation breakdowns; 3) conversational robotic systems must combine additional affective modalities in order to enhance user engagement and trust; and 4) clinical trials with validated models, clear data protection, and healthcare benchmarks are needed to properly translate conversational robots and voice assistants for ageing and dementia care.
Advances in ML and particularly in conversational AI will be a major driving force in the development of truly effective and personalized SAR systems with autonomous natural language capabilities. However, the largest challenges in actual deployment for patient utility lie in user adherence and insurance of data privacy. These must be addressed in parallel with advances in machine intelligence for tangible user benefit. Overall, conversational affective social robots, voice assistants, and smart speaker technology hold strong potential to promote independence, companionship, health monitoring, and cognitive stimulation of older adults and people living with dementia, which could ultimately be translated into robotic deployable therapeutic and telemedicine solutions.