Abstract:
Recent years have witnessed tremendous advancements in Al tools (e.g., ChatGPT, GPT 4, and Bard), driven by the growing power, reasoning, and efficiency of Large Language...Show MoreMetadata
Abstract:
Recent years have witnessed tremendous advancements in Al tools (e.g., ChatGPT, GPT 4, and Bard), driven by the growing power, reasoning, and efficiency of Large Language Models (LLMs). LLMs have been shown to excel in tasks ranging from poem writing and coding to essay generation and puzzle solving. Despite their proficiency in general queries, specialized tasks such as metaphor understanding and fake news detection often require finely tuned models, posing a comparison challenge with specialized Deep Learning (DL). We propose an assessment framework to compare task-specific intelligence with general-purpose LLMs on suicide and depression tendency identification. For this purpose, we trained two DL models on a suicide and depression detection dataset, followed by testing their performance on a test set. Afterward, the same test dataset is used to evaluate the performance of four LLMs (GPT 3.5, GPT 4, Google Bard, and MS Bing) using four classification metrics. The BERT-based DL model performed the best among all, with a testing accuracy of 94.61%, while GPT 4 was the runner-up with accuracy 92.5%. Results demonstrate that LLMs do not outperform the specialized DL models but are able to achieve comparable performance, making them a decent option for downstream tasks without specialized training. However, LLMs outperformed specialized models on the reduced dataset.
Published in: IEEE Transactions on Big Data ( Early Access )