Abstract:
Due to the migration megatrend, efficient and effective second-language acquisition is vital. One proposed solution involves AI-enabled conversational agents for person-c...Show MoreMetadata
Abstract:
Due to the migration megatrend, efficient and effective second-language acquisition is vital. One proposed solution involves AI-enabled conversational agents for person-centered interactive language practice. We present results from ongoing action research targeting quality assurance of proprietary generative dialog models trained for virtual job interviews. The action team elicited a set of 38 requirements for which we designed corresponding automated test cases for 15 of particular interest to the evolving solution. Our results show that six of the test case designs can detect meaningful differences between candidate models. While quality assurance of natural language processing applications is complex, we provide initial steps toward an automated framework for machine learning model selection in the context of an evolving conversational agent. Future work will focus on model selection in an MLOps setting.
Published in: 2022 IEEE/ACM 1st International Conference on AI Engineering – Software Engineering for AI (CAIN)
Date of Conference: 16-17 May 2022
Date Added to IEEE Xplore: 17 June 2022
ISBN Information:
References is not available for this document.
Select All
1.
2022. Emely Testing Repo. https://github.com/JoohanBengtsson/Emely-testing/
2.
Guy J Abel, Michael Brottrager, Jesus Crespo Cuaresma, and Raya Muttarak. 2019. Climate, conflict and forced migration. Global Environmental Change 54 ( 2019 ), 239–249.
3.
James F Allen, Lenhart K Schubert, and George Ferguson et al. 1995. The TRAINS project: A case study in building a conversational planning agent. Journal of Experimental Theoretical Artificial Intelligence 7, 1 ( 1995 ), 7–48.
4.
Theo Araujo. 2018. Living up to the chatbot hype: The influence of anthropomorphic design cues and communicative agency framing on conversational agent and company perceptions. Computers in Human Behavior 85 ( 2018 ), 183–189.
5.
Earl T Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2014. The oracle problem in software testing: A survey. IEEE Transactions on Softw. Engineering 41, 5 ( 2014 ), 507–525.
6.
Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: Analyzing text with the natural language toolkit. O’Reilly Media.
7.
Elizabeth Bjarnason, Michael Unterkalmsteiner, Markus Borg, and Emelie Engström. 2016. A multi-case study of agile requirements engineering and the use of test cases as requirements. Information and Softw. Technology 77 ( 2016 ), 61–79.
8.
Helen L Blake, Laura Bennetts Kneebone, and Sharynne McLeod. 2017. The impact of oral English proficiency on humanitarian migrants’ experiences of settling in Australia. Int’l. Journal of Bilingual Education and Bilingualism 22, 6 ( 2017 ), 689–705.
9.
Markus Borg. 2021. The AIQ Meta-Testbed: Pragmatically bridging academic AI testing and industrial Q needs. In Proc. of the Int’l. Conference on Softw. Quality. Springer, 66–77.
10.
Markus Borg. 2022. Agility in software 2.0–Notebook interfaces and MLOps with buttresses and rebars. In Proc. of the Int’l. Conference on Lean and Agile Softw. Development. Springer, 3–16.
11.
Markus Borg, Cristofer Englund, Krzysztof Wnuk, et al. 2019. Safely entering the deep: A review of verification and validation for machine learning and a challenge elicitation in the automotive industry. Journal of Automotive Softw. Engineering 1, 1 ( 2019 ), 1–19.
12.
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005. 14165 ( 2020 ).
13.
Noam Chomsky. 2014. Aspects of the Theory of Syntax. Vol. 11. MIT press.
14.
Mingkai Deng, Bowen Tan, Zhengzhong Liu, Eric Xing, and Zhiting Hu. 2021. Compression, transduction, and creation: A unified framework for evaluating natural language generation. In Proc. of the 2021 Conference on Empirical Methods in Natural Language Processing. 7580–7605.
15.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810. 04805 ( 2018 ).
16.
Lorenzo Ferrone and Fabio Massimo Zanzotto. 2020. Symbolic, distributed, and distributional representations for natural language processing in the era of deep learning: A survey. Frontiers in Robotics and AI 6 ( 2020 ), 153.
17.
Vahid Garousi, Michael Felderer, Mika V Mäntylä, and Austen Rainer. 2020. Benefitting from the grey literature in software engineering research. In Contemporary Empirical Methods in Softw. Engineering. Springer, 385–413.
18.
Laura Hanu and Unitary team. 2020. Detoxify. Github. https://github.com/unitaryai/detoxify
19.
Wilhelm Hasselbring. 2021. Benchmarking as empirical standard in software engineering research. In Proc. of the 25th Int’l. Conference Evaluation and Assessment in Softw. Engineering. 365–372.
20.
Shafquat Hussain, Omid Ameri Sianaki, and Nedal Ababneh. 2019. A survey on conversational agents/chatbots classification and design techniques. In Workshops of the Int’l. Conference on Advanced Information Networking and Applications. Springer, 946–956.
21.
ISO/IEC/IEEE. 2013. Software and Systems Engineering–Software Testing–Part 1: Concepts and Definitions. Technical Report ISO/IEC/IEEE CD 29119-1:2020. International Organization for Standardization.
22.
Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding deep learning system testing using surprise adequacy. In Proc. of the 41st Int’l. Conference on Softw. Engineering. 1039–1049.
23.
Seah Kim and Shin Yoo. 2021. Multimodal surprise adequacy analysis of inputs for natural language processing DNN models. In Proc. of the Int’l. Conference on Automation of Softw. Test. 80–89.
24.
Diane Larsen-Freeman. 2018. Looking ahead: Future directions in, and future research into, second language acquisition. Foreign language annals 51, 1 ( 2018 ), 55–72.
25.
Zixi Liu, Yang Feng, and Zhenyu Chen. 2021. DialTest: Automated testing for recurrent-neural-network-driven dialogue systems. In Proc. of the 30th International Symposium on Software Testing and Analysis. 115–126.
26.
Shawn Loewen, Dustin Crowther, Daniel R Isbell, Kathy Minhye Kim, Jeffrey Maloney, Zachary F Miller, and Hima Rawal. 2019. Mobile-assisted language learning: A Duolingo case study. ReCALL 31, 3 ( 2019 ), 293–311.
27.
Jesús Martín, Carlos Muñoz-Romero, and Nieves Ábalos. 2021. Chatbottest–The Free Guide For You to Understand What is Your Chatbot Doing Wrong. https://chatbottest.com/. Accessed: 2021 - 12 - 27.
28.
Nasar Meer, Timothy Peace, and Emma Hill. 2019. English language education for asylum seekers and refugees in Scotland: Provision and governance. Edinburgh: GLIMER Project 6 ( 2019 ), 2019.
29.
Linda Morrice, Linda K Tip, Michael Collyer, and Rupert Brown. 2021. ‘You can’t have a good integration when you don’t have a good communication’: Englishlanguage learning among resettled refugees in England. Journal of Refugee Studies 34, 1 ( 2021 ), 681–699.
30.
United Nations. 2020. International Migration 2020 Highlights. Technical Report ST/ESA/SER.A/452. United Nations Department of Economic and Social Affairs, Population Division.