Conferences >2023 IEEE International Confe...

A Testing Framework for AI Linguistic Systems (testFAILS)

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

This paper introduces testFAILS, an innovative testing framework designed for the rigorous evaluation of AI Linguistic Systems, with a particular emphasis on various iter...Show More

Metadata

Abstract:

This paper introduces testFAILS, an innovative testing framework designed for the rigorous evaluation of AI Linguistic Systems, with a particular emphasis on various iterations of ChatGPT. Leveraging orthogonal array coverage, this framework provides a robust mechanism for assessing AI systems, addressing the critical question, “How should we evaluate AI?” While the Turing test has traditionally been the benchmark for AI evaluation, we argue that current publicly available chatbots, despite their rapid advancements, have yet to meet this standard. However, the pace of progress suggests that achieving Turing test-level performance may be imminent. In the interim, the need for effective AI evaluation and testing methodologies remains paramount. Our research, which is ongoing, has already validated several versions of ChatGPT, and we are currently conducting comprehensive testing on the latest models, including ChatGPT-4, Bard, Bing Bot, and the LLaMA model. The testFAILS framework is designed to be adaptable, ready to evaluate new bot versions as they are released. Additionally, we have tested available chatbot APIs and developed our own application, AIDoctor, utilizing the ChatGPT-4 model and Microsoft Azure AI technologies.

Published in: 2023 IEEE International Conference On Artificial Intelligence Testing (AITest)

Date of Conference: 17-20 July 2023

Date Added to IEEE Xplore: 29 August 2023

ISBN Information:

ISSN Information:

DOI: 10.1109/AITest58265.2023.00017

Conference Location: Athens, Greece

Contents

I. Introduction

Our research pioneers the development of testFAILS , a robust framework for the rigorous appraisal and juxtaposition of leading chatbots, including OpenAI’s ChatGPT-4 [1] , Google’s Bard [2] , Meta’s LLaMA [3] , Microsoft’s Bing Chat [4] , and emerging contenders like Elon Musk’s TruthGPT [5] . TestFAILS adopts an adversarial stance, spotlighting chatbot deficiencies to counterbalance the frequent media hype around “AI breakthroughs.” Despite rapid AI advancements, no chatbot has yet achieved the Turing Test benchmark, underscoring the pressing need for effective evaluation methodologies. TestFAILS encapsulates six pivotal components: Simulated Turing Test Performance, User Productivity and Satisfaction, Integration into Computer Science Education, Multilingual Text Generation, Pair Programming Capabilities, and Success in Bot-Based App Development . We have implemented a ternary evaluation system, assigning pass, failure, or undetermined status based on counterexamples, with numerical indicators for score comparison and aggregation. To fine-tune component weighting, we engaged the chatbots themselves in rating component importance. The chatbot’s responses, as shown in Table 1 , revealed consensus on the first two components, with variation in others.

References is not available for this document.

A Testing Framework for AI Linguistic Systems (testFAILS)

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

A Testing Framework for AI Linguistic Systems (testFAILS)

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

I. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

References

IEEE Account

Purchase Details

Profile Information

Need Help?