Use me wisely: AI-driven assessment for LLM prompting skills development

Prompting with large language model (LLM) powered chatbots, such as ChatGPT, is adopted in a variety of tasks and processes across different domains. Given the intrinsic complexity of LLMs, effective prompting is not as straightforward as anticipated which highlights the need for novel educational a...

Full description

Saved in:
Bibliographic Details
Main Author: Dimitri Ognibene, Gregor Donabauer, Emily Theophilou, Cansu Koyuturk, Mona Yavari, Sathya Bursic, Alessia Telari, Alessia Testa, Raffaele Boiano, Davide Taibi, Davinia Hernandez-Leo, Udo Kruschwitz and Martin Ruskov
Format: Article
Language:English
Published: International Forum of Educational Technology & Society 2025-07-01
Series:Educational Technology & Society
Subjects:
Online Access:https://www.j-ets.net/collection/published-issues/28_3#h.8qxfv1d3o98l
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Prompting with large language model (LLM) powered chatbots, such as ChatGPT, is adopted in a variety of tasks and processes across different domains. Given the intrinsic complexity of LLMs, effective prompting is not as straightforward as anticipated which highlights the need for novel educational and support methods that are both widely accessible and seamlessly integrated into task workflows. However, LLM prompting shows strong dependence on the specific task and domain, reducing the usefulness of generic methods. We intend to investigate if LLM-based methods can support learning assessments using ad-hoc guidelines and an extremely limited number of annotated prompt samples. In our framework, guidelines are transformed into features to be detected in the learners’ prompts. The descriptions of these features, together with annotated sample prompts, are used to create few-shot learning detectors. We compare various configurations of these few-shot detectors testing 3 state-of-the-art LLMs and derived ensemble models. Our experiments are performed using cross-validation on original sample prompts and a specifically collected test set of prompts from task-naive learners. We find a strong impact of the LLMs on our feature list. One of the most recent models, GPT-4, shows promising performance on most of the features. However, some closely connected models (GPT-3, GPT-3.5 Turbo (Instruct)) show different behaviors when classifying features. We highlight the need for further research in light of the possible impact of design choices on the selection of features and detection prompts. Our findings are of relevance for researchers and practitioners in generative AI literacy, as well as researchers in computer-supported learning assessment.
ISSN:1176-3647
1436-4522