Use me wisely: AI-driven assessment for LLM prompting skills development
Prompting with large language model (LLM) powered chatbots, such as ChatGPT, is adopted in a variety of tasks and processes across different domains. Given the intrinsic complexity of LLMs, effective prompting is not as straightforward as anticipated which highlights the need for novel educational a...
Saved in:
Main Author: | |
---|---|
Format: | Article |
Language: | English |
Published: |
International Forum of Educational Technology & Society
2025-07-01
|
Series: | Educational Technology & Society |
Subjects: | |
Online Access: | https://www.j-ets.net/collection/published-issues/28_3#h.8qxfv1d3o98l |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Prompting with large language model (LLM) powered chatbots, such as ChatGPT, is adopted in a variety of tasks and processes across different domains. Given the intrinsic complexity of LLMs, effective prompting is not as straightforward as anticipated which highlights the need for novel educational and support methods that are both widely accessible and seamlessly integrated into task workflows. However, LLM prompting shows strong dependence on the specific task and domain, reducing the usefulness of generic methods. We intend to investigate if LLM-based methods can support learning assessments using ad-hoc guidelines and an extremely limited number of annotated prompt samples. In our framework, guidelines are transformed into features to be detected in the learners’ prompts. The descriptions of these features, together with annotated sample prompts, are used to create few-shot learning detectors. We compare various configurations of these few-shot detectors testing 3 state-of-the-art LLMs and derived ensemble models. Our experiments are performed using cross-validation on original sample prompts and a specifically collected test set of prompts from task-naive learners. We find a strong impact of the LLMs on our feature list. One of the most recent models, GPT-4, shows promising performance on most of the features. However, some closely connected models (GPT-3, GPT-3.5 Turbo (Instruct)) show different behaviors when classifying features. We highlight the need for further research in light of the possible impact of design choices on the selection of features and detection prompts. Our findings are of relevance for researchers and practitioners in generative AI literacy, as well as researchers in computer-supported learning assessment. |
---|---|
ISSN: | 1176-3647 1436-4522 |