Adapting an English Corpus and a Question Answering System for Slovene

Developing effective question answering (QA) models for less-resourced languages like Slovene is challenging due to the lack of proper training data. Modern machine translation tools can address this issue, but this presents another challenge: the given answers must be found in their exact form with...

Whakaahuatanga katoa

I tiakina i:

Ngā taipitopito rārangi puna kōrero
Ngā kaituhi matua:	Uroš Šmajdek, Matjaž Zupanič, Maj Zirkelbach, Meta Jazbinšek
Hōputu:	Tuhinga
Reo:	Ingarihi
I whakaputaina:	University of Ljubljana Press (Založba Univerze v Ljubljani) 2023-09-01
Rangatū:	Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave
Ngā marau:	question answering machine translation multilingual models
Urunga tuihono:	https://journals.uni-lj.si/slovenscina2/article/view/12064
Tags:	Tāpirihia he Tūtohu No Tags, Be the first to tag this record!

_version_	1839629941627420672
author	Uroš Šmajdek Matjaž Zupanič Maj Zirkelbach Meta Jazbinšek
author_facet	Uroš Šmajdek Matjaž Zupanič Maj Zirkelbach Meta Jazbinšek
author_sort	Uroš Šmajdek
collection	DOAJ
description	Developing effective question answering (QA) models for less-resourced languages like Slovene is challenging due to the lack of proper training data. Modern machine translation tools can address this issue, but this presents another challenge: the given answers must be found in their exact form within the given context since the model is trained to locate answers and not generate them. To address this challenge, we propose a method that embeds the answers within the context before translation and evaluate its effectiveness on the SQuAD 2.0 dataset translated using both eTranslation and Google Cloud translator. The results show that by employing our method we can reduce the rate at which answers were not found in the context from 56% to 7%. We then assess the translated datasets using various transformer-based QA models, examining the differences between the datasets and model configurations. To ensure that our models produce realistic results, we test them on a small subset of the original data that was human-translated. The results indicate that the primary advantages of using machine-translated data lie in refining smaller multilingual and monolingual models. For instance, the multilingual CroSloEngual BERT model fine-tuned and tested on Slovene data achieved nearly equivalent performance to one fine-tuned and tested on English data, with 70.2% and 73.3% questions answered, respectively. While larger models, such as RemBERT, achieved comparable results, correctly answering questions in 77.9% of cases when fine-tuned and tested on Slovene compared to 81.1% on English, fine-tuning with English and testing with Slovene data also yielded similar performance.
format	Article
id	doaj-art-e82081c8122a4ecabc00a6fd7a167b2a
institution	Matheson Library
issn	2335-2736
language	English
publishDate	2023-09-01
publisher	University of Ljubljana Press (Založba Univerze v Ljubljani)
record_format	Article
series	Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave
spelling	doaj-art-e82081c8122a4ecabc00a6fd7a167b2a2025-07-14T12:57:03ZengUniversity of Ljubljana Press (Založba Univerze v Ljubljani)Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave2335-27362023-09-0111110.4312/slo2.0.2023.1.247-27418451Adapting an English Corpus and a Question Answering System for SloveneUroš Šmajdek0Matjaž Zupanič1Maj Zirkelbach2Meta Jazbinšek3University of Ljubljana, Faculty of Computer and Information Science, SloveniaUniversity of Ljubljana, Faculty of Computer and Information Science, SloveniaUniversity of Ljubljana, Faculty of Computer and Information Science, SloveniaUniversity of Ljubljana, Faculty of Arts, SloveniaDeveloping effective question answering (QA) models for less-resourced languages like Slovene is challenging due to the lack of proper training data. Modern machine translation tools can address this issue, but this presents another challenge: the given answers must be found in their exact form within the given context since the model is trained to locate answers and not generate them. To address this challenge, we propose a method that embeds the answers within the context before translation and evaluate its effectiveness on the SQuAD 2.0 dataset translated using both eTranslation and Google Cloud translator. The results show that by employing our method we can reduce the rate at which answers were not found in the context from 56% to 7%. We then assess the translated datasets using various transformer-based QA models, examining the differences between the datasets and model configurations. To ensure that our models produce realistic results, we test them on a small subset of the original data that was human-translated. The results indicate that the primary advantages of using machine-translated data lie in refining smaller multilingual and monolingual models. For instance, the multilingual CroSloEngual BERT model fine-tuned and tested on Slovene data achieved nearly equivalent performance to one fine-tuned and tested on English data, with 70.2% and 73.3% questions answered, respectively. While larger models, such as RemBERT, achieved comparable results, correctly answering questions in 77.9% of cases when fine-tuned and tested on Slovene compared to 81.1% on English, fine-tuning with English and testing with Slovene data also yielded similar performance. https://journals.uni-lj.si/slovenscina2/article/view/12064question answeringmachine translationmultilingual models
spellingShingle	Uroš Šmajdek Matjaž Zupanič Maj Zirkelbach Meta Jazbinšek Adapting an English Corpus and a Question Answering System for Slovene Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave question answering machine translation multilingual models
title	Adapting an English Corpus and a Question Answering System for Slovene
title_full	Adapting an English Corpus and a Question Answering System for Slovene
title_fullStr	Adapting an English Corpus and a Question Answering System for Slovene
title_full_unstemmed	Adapting an English Corpus and a Question Answering System for Slovene
title_short	Adapting an English Corpus and a Question Answering System for Slovene
title_sort	adapting an english corpus and a question answering system for slovene
topic	question answering machine translation multilingual models
url	https://journals.uni-lj.si/slovenscina2/article/view/12064
work_keys_str_mv	AT urossmajdek adaptinganenglishcorpusandaquestionansweringsystemforslovene AT matjazzupanic adaptinganenglishcorpusandaquestionansweringsystemforslovene AT majzirkelbach adaptinganenglishcorpusandaquestionansweringsystemforslovene AT metajazbinsek adaptinganenglishcorpusandaquestionansweringsystemforslovene

Adapting an English Corpus and a Question Answering System for Slovene

Ngā tūemi rite