Comparative Judgment: Building a Shared Consensus Over Rater Variation in Assessing Second Language Writing Performance

Rater variation has been a persistent concern for rater-mediated writing assessments. Instead of treating rater variation as an undesired source of measurement error, the method of comparative judgment (CJ) uses pairwise comparisons to elicit relative judgments from raters and statistical estimation...

Full description

Saved in:
Bibliographic Details
Main Author: Qian Wu
Format: Article
Language:English
Published: SAGE Publishing 2025-06-01
Series:SAGE Open
Online Access:https://doi.org/10.1177/21582440251346346
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Rater variation has been a persistent concern for rater-mediated writing assessments. Instead of treating rater variation as an undesired source of measurement error, the method of comparative judgment (CJ) uses pairwise comparisons to elicit relative judgments from raters and statistical estimation to construct a measurement scale to rank object items, offering a viable approach to accommodate rater-associated heterogeneity of judgment making on the one hand and obtain reliable and valid outcomes on the other hand. The current study systematically examined the utility and quality of CJ as an assessment tool in the context of second language writing. A group of 16 raters (8 experienced and 8 novice) performed the CJ assessment on 94 pieces of English writing texts in the absence of rubric criteria. Despite raters’ varying expertise and rating experiences, raters were able to deliver judgments consistent with the shared consensus, yielding a CJ rank order of the writing texts with a moderate reliability. The analyses of raters’ justifications for judgment making showed that raters varied substantially in terms of evaluation criteria, but the collective expertise derived from the iterative CJ process presented a close alignment with the established scoring rubric. Additionally, inconsistencies were explored when raters and texts significantly deviated from the consensus of judgments, and practical implications were discussed. The results provide empirical evidence for the construct validity of CJ and add a novel perspective to the discussion of rater variation in second language writing assessment.
ISSN:2158-2440