Rater Reliability and Rating Scale Utility for the AP Japanese Computer-Simulated Conversation Task: Evaluation Inference

This study examined the validity of the scoring procedures for the AP Japanese conversation task using an argument-based approach, with a focus on rater reliability and rating scale functioning. Data were collected from 102 high school students through a test simulation, with three raters scoring th...

Full description

Saved in:
Bibliographic Details
Main Author: Nana Suzumura-Smith
Format: Article
Language:English
Published: National Council of Less Commonly Taught Languages 2025-07-01
Series:Journal of the National Council of Less Commonly Taught Languages
Subjects:
Online Access:https://ncolctl.org/wp-content/uploads/2025/07/vol38-p4.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This study examined the validity of the scoring procedures for the AP Japanese conversation task using an argument-based approach, with a focus on rater reliability and rating scale functioning. Data were collected from 102 high school students through a test simulation, with three raters scoring the performances using a common 7-point scale. Test scores were analyzed across raters and speech acts using the Partial Credit Rasch model. Results provided support for rater reliability but only limited support for the intended functioning of the rating scale. To enhance task validity, three potential modifications were proposed: controlling speech act types and numbers, reducing the number of score categories, and modifying the scoring procedure. This study sheds light on the validity argument for the AP Japanese conversation task and addresses the scarcity of validity evidence for this exam. The findings underscore the importance of empirically confirming rating scale functioning in any assessment context.
ISSN:1930-9031
2689-2979