Converting raw transcripts into an annotated and turn-aligned TEI-XML corpus: the example of the Corpus of Serbian Forms of Address

This paper describes the procedure of building a TEI-XML corpus of spoken Serbian starting from raw transcripts. The corpus consists of semi–structured interviews, which were gathered with the aim of investigating forms of address in Serbian. The interviews were thoroughly transcribed according to G...

Full description

Saved in:

Bibliographic Details
Main Author:	Dolores Lemmenmeier-Batinić
Format:	Article
Language:	English
Published:	University of Ljubljana Press (Založba Univerze v Ljubljani) 2021-07-01
Series:	Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave
Subjects:	spoken Serbian language biographical interviews forms of address data re-usability
Online Access:	https://journals.uni-lj.si/slovenscina2/article/view/9869
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1839629915834548224
author	Dolores Lemmenmeier-Batinić
author_facet	Dolores Lemmenmeier-Batinić
author_sort	Dolores Lemmenmeier-Batinić
collection	DOAJ
description	This paper describes the procedure of building a TEI-XML corpus of spoken Serbian starting from raw transcripts. The corpus consists of semi–structured interviews, which were gathered with the aim of investigating forms of address in Serbian. The interviews were thoroughly transcribed according to GAT transcribing conventions. However, the transcription was carried out without tools that would control the validity of the GAT syntax, or align the transcript with the audio records. In order to offer this resource to a broader audience, we resolved the inconsistencies in the original transcripts, normalised the semi-orthographic transcriptions and converted the corpus into a TEI-format for transcriptions of speech. Further, we enriched the corpus by tagging and lemmatising the data. Lastly, we aligned the corpus turns to the corresponding audio segments by using a force-alignment tool. In addition to presenting the main steps involved in converting the corpus to the XML-format, this paper also discusses current challenges in the processing of spoken data, and the implications of data re-use regarding transcriptions of speech. This corpus can be used for studying Serbian from the perspective of interactional linguistics, for investigating morphosyntax, grammar, lexicon and phonetics of spoken Serbian, for studying disfluencies, as well as for testing models for automatic speech recognition and forced alignment. The corpus is freely available for research purposes.
format	Article
id	doaj-art-4cfc20c74f5e486d8d83d42a3cf7fe2d
institution	Matheson Library
issn	2335-2736
language	English
publishDate	2021-07-01
publisher	University of Ljubljana Press (Založba Univerze v Ljubljani)
record_format	Article
series	Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave
spelling	doaj-art-4cfc20c74f5e486d8d83d42a3cf7fe2d2025-07-14T12:57:10ZengUniversity of Ljubljana Press (Založba Univerze v Ljubljani)Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave2335-27362021-07-019110.4312/slo2.0.2021.1.123-144Converting raw transcripts into an annotated and turn-aligned TEI-XML corpus: the example of the Corpus of Serbian Forms of AddressDolores Lemmenmeier-Batinić0University of Zurich, Department of Slavonic Languages and Literatures, SwitzerlandThis paper describes the procedure of building a TEI-XML corpus of spoken Serbian starting from raw transcripts. The corpus consists of semi–structured interviews, which were gathered with the aim of investigating forms of address in Serbian. The interviews were thoroughly transcribed according to GAT transcribing conventions. However, the transcription was carried out without tools that would control the validity of the GAT syntax, or align the transcript with the audio records. In order to offer this resource to a broader audience, we resolved the inconsistencies in the original transcripts, normalised the semi-orthographic transcriptions and converted the corpus into a TEI-format for transcriptions of speech. Further, we enriched the corpus by tagging and lemmatising the data. Lastly, we aligned the corpus turns to the corresponding audio segments by using a force-alignment tool. In addition to presenting the main steps involved in converting the corpus to the XML-format, this paper also discusses current challenges in the processing of spoken data, and the implications of data re-use regarding transcriptions of speech. This corpus can be used for studying Serbian from the perspective of interactional linguistics, for investigating morphosyntax, grammar, lexicon and phonetics of spoken Serbian, for studying disfluencies, as well as for testing models for automatic speech recognition and forced alignment. The corpus is freely available for research purposes. https://journals.uni-lj.si/slovenscina2/article/view/9869spoken Serbianlanguage biographical interviewsforms of addressdata re-usability
spellingShingle	Dolores Lemmenmeier-Batinić Converting raw transcripts into an annotated and turn-aligned TEI-XML corpus: the example of the Corpus of Serbian Forms of Address Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave spoken Serbian language biographical interviews forms of address data re-usability
title	Converting raw transcripts into an annotated and turn-aligned TEI-XML corpus: the example of the Corpus of Serbian Forms of Address
title_full	Converting raw transcripts into an annotated and turn-aligned TEI-XML corpus: the example of the Corpus of Serbian Forms of Address
title_fullStr	Converting raw transcripts into an annotated and turn-aligned TEI-XML corpus: the example of the Corpus of Serbian Forms of Address
title_full_unstemmed	Converting raw transcripts into an annotated and turn-aligned TEI-XML corpus: the example of the Corpus of Serbian Forms of Address
title_short	Converting raw transcripts into an annotated and turn-aligned TEI-XML corpus: the example of the Corpus of Serbian Forms of Address
title_sort	converting raw transcripts into an annotated and turn aligned tei xml corpus the example of the corpus of serbian forms of address
topic	spoken Serbian language biographical interviews forms of address data re-usability
url	https://journals.uni-lj.si/slovenscina2/article/view/9869
work_keys_str_mv	AT doloreslemmenmeierbatinic convertingrawtranscriptsintoanannotatedandturnalignedteixmlcorpustheexampleofthecorpusofserbianformsofaddress

Converting raw transcripts into an annotated and turn-aligned TEI-XML corpus: the example of the Corpus of Serbian Forms of Address

Similar Items