Data preparation in crowdsourcing for pedagogical purposes

One way to stimulate the use of corpora in language education is by making pedagogically appropriate corpora, labeled with different types of problems (sensitive content, offensive language, structural problems). However, manually labeling corpora is extremely time-consuming and a better approach sh...

Full description

Saved in:

Bibliographic Details
Main Authors:	Tanara Zingano Kuhn, Špela Arhar Holdt, Iztok Kosem, Carole Tiberius, Kristina Koppel, Rina Zviel-Girshin
Format:	Article
Language:	English
Published:	University of Ljubljana Press (Založba Univerze v Ljubljani) 2022-12-01
Series:	Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave
Subjects:	crowdsourcing game with a purpose example sentences pedagogical corpus
Online Access:	https://journals.uni-lj.si/slovenscina2/article/view/11431
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	One way to stimulate the use of corpora in language education is by making pedagogically appropriate corpora, labeled with different types of problems (sensitive content, offensive language, structural problems). However, manually labeling corpora is extremely time-consuming and a better approach should be found. We thus propose a combination of two approaches to the creation of problem-labeled pedagogical corpora of Dutch, Estonian, Slovene and Brazilian Portuguese: the use of games with a purpose and of crowdsourcing for the task. We conducted initial experiments to establish the suitability of the crowdsourcing task, and used the lessons learned to design the Crowdsourcing for Language Learning (CrowLL) game in which players identify problematic sentences, classify them, and indicate problematic excerpts. The focus of this paper is on data preparation, given the crucial role that such a stage plays in any crowdsourcing project dealing with the creation of language learning resources. We present the methodology for data preparation, offering a detailed presentation of source corpora selection, pedagogically oriented GDEX configurations, and the creation of lemma lists, with a special focus on common and language-dependent decisions. Finally, we offer a discussion of the challenges that emerged and the solutions that have been implemented so far.
ISSN:	2335-2736

Data preparation in crowdsourcing for pedagogical purposes

Similar Items