An Analysis of the Training Data Impact for Domain-Adapted Tokenizer Performances—The Case of Serbian Legal Domain Adaptation

Various areas of natural language processing (NLP) have greatly benefited from the development of large language models in recent years. This research addresses the challenge of developing efficient tokenizers for transformer-based domain-specific language models. Tokenization efficiency within tran...

Full description

Saved in:
Bibliographic Details
Main Authors: Miloš Bogdanović, Milena Frtunić Gligorijević, Jelena Kocić, Leonid Stoimenov
Format: Article
Language:English
Published: MDPI AG 2025-07-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/13/7491
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Various areas of natural language processing (NLP) have greatly benefited from the development of large language models in recent years. This research addresses the challenge of developing efficient tokenizers for transformer-based domain-specific language models. Tokenization efficiency within transformer-based models is directly related to model efficiency, which motivated the research we present in this paper. Our goal in this research was to demonstrate that the appropriate selection of data used for tokenizer training has a significant impact on tokenizer performance. Subsequently, we will demonstrate that efficient tokenizers and models can be developed even if language resources are limited. To do so, we will present a domain-adapted large language model tokenizer developed for masked language modeling of the Serbian legal domain. In this paper, we will present a comparison of the tokenization performance for a domain-adapted tokenizer in version 2 of the SrBERTa language model we developed, against the performances of five other tokenizers belonging to state-of-the-art multilingual, Slavic or Serbian-specific models—XLM-RoBERTa (base-sized), BERTić, Jerteh-81, SrBERTa v1, NER4Legal_SRB. The comparison is performed using a test dataset consisting of 275,660 samples of legal texts written in the Cyrillic alphabet gathered from the Official Gazette of the Republic of Serbia. This dataset contains 197,134 distinct words, while the overall word count is 5,265,352. We will show that our tokenizer, trained upon a domain-adapted dataset, outperforms presented tokenizers by at least 4.5% ranging to 54.62%, regarding the number of tokens generated for the whole test dataset. In terms of tokenizer fertility, we will show that our tokenizer outperforms compared tokenizers by at least 6.39% ranging to 56.8%.
ISSN:2076-3417