An Analysis of the Training Data Impact for Domain-Adapted Tokenizer Performances—The Case of Serbian Legal Domain Adaptation

QR Code

An Analysis of the Training Data Impact for Domain-Adapted Tokenizer Performances—The Case of Serbian Legal Domain Adaptation

Various areas of natural language processing (NLP) have greatly benefited from the development of large language models in recent years. This research addresses the challenge of developing efficient tokenizers for transformer-based domain-specific language models. Tokenization efficiency within tran...

Full description

Saved in:

Bibliographic Details
Main Authors:	Miloš Bogdanović, Milena Frtunić Gligorijević, Jelena Kocić, Leonid Stoimenov
Format:	Article
Language:	English
Published:	MDPI AG 2025-07-01
Series:	Applied Sciences
Subjects:	tokenization large language model natural language processing BERT domain adaptation Serbian
Online Access:	https://www.mdpi.com/2076-3417/15/13/7491
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

EDUCATIONAL FUNCTIONS OF STUDENTS’ CREATIVE WORK IN PRIMARY SCHOOL SERBIAN LANGUAGE CLASSES: TEACHERS’ PERSPECTIVE
by: Iva Medojević, et al.
Published: (2025-06-01)

THE SOCIOLINGUISTIC SITUATION IN PRESENT-DAY MONTENEGRO – SERBIAN STUDIES, MONTENEGRIN STUDIES
by: D. Bojović
Published: (2018-12-01)

Artificial intelligence in foreign language teaching: Evaluating the reliability of large language models with a focus on Serbian as a foreign language
by: Danijela D. Vranješ, et al.
Published: (2025-07-01)

SUPPLEMENTARY TEACHING IN THE SERBIAN LANGUAGE ABROAD: PROBLEMS AND EXPERIENCES
by: Jelena M. Jovanović
Published: (2025-06-01)

Machine learning methods (tokenization) in marketing research
by: E. V. Ganebnykh, et al.
Published: (2024-06-01)

IT-SR-NER: SERBIAN-ITALIAN PARALLEL CORPUS FOR LEARNING SERBIAN AS A FOREIGN LANGUAGE
by: Olja Perišić, et al.
Published: (2025-06-01)

Legal regime of NFT (non-fungible token) in Russia: How to work in the absence of special legislative regulation?
by: Y. V. Brisov, et al.
Published: (2022-04-01)

Transnational business strategies of Serbian entrepreneurs in Vienna
by: Antonijević Dragana, et al.
Published: (2025-01-01)

PRECEDENT NAMES IN SERBIAN CULTURE AND THEIR REFLECTIONS ON THE COMMUNICATIVE FUNCTION OF LANGUAGE
by: Đina M. Vesić, et al.
Published: (2025-06-01)

Converting raw transcripts into an annotated and turn-aligned TEI-XML corpus: the example of the Corpus of Serbian Forms of Address
by: Dolores Lemmenmeier-Batinić
Published: (2021-07-01)

Reconstructing Domain-Specific Features for Unsupervised Domain-Adaptive Object Detection
by: Shuai Dong, et al.
Published: (2025-05-01)

Risks and Prospects of Creativity Tokenization
by: R. A. Budnik
Published: (2023-08-01)

Token Money or Cryptocurrency: technological Content and Economic Essence
by: A. V.  Varnavskiy
Published: (2018-11-01)

A Comprehensive Approach to Instruction Tuning for Qwen2.5: Data Selection, Domain Interaction, and Training Protocols
by: Xungang Gu, et al.
Published: (2025-07-01)

Comparative Analysis of BERT and GPT for Classifying Crisis News with Sudan Conflict as an Example
by: Yahya Masri, et al.
Published: (2025-07-01)

Domain Knowledge-Enhanced Process Mining for Anomaly Detection in Commercial Bank Business Processes
by: Yanying Li, et al.
Published: (2025-07-01)

ABERT: Adapting BERT model for efficient detection of human and AI-generated fake news
by: Jawaher Alghamdi, et al.
Published: (2025-12-01)

Gender perspective on teacher-pupil classroom interaction: Feedback and evaluation
by: Margareta Basaragin, et al.
Published: (2019-12-01)

Three decades of Serbian discourse studies (1993-2023)
by: Slijepčević-Bjelivuk Svetlana M., et al.
Published: (2025-01-01)

Multi-Faceted Adaptive Token Pruning for Efficient Remote Sensing Image Segmentation
by: Chuge Zhang, et al.
Published: (2025-07-01)

Tokenization of creativity: user motivation, consensual value and Chinese copyright law
by: R. A. Budnik
Published: (2023-06-01)

Token Economy in Improving Discipline of Al-Quran Education Park Students
by: Abd. Hamid Cholili, et al.
Published: (2025-02-01)

Token ring networks /
by: Bird, David
Published: (1992)

BanglaHealth: A Bengali paraphrase dataset on health domainHugging Face
by: Faisal Ibn Aziz, et al.
Published: (2025-08-01)

Achieving Efficient Prompt Engineering in Large Language Models Using a Hybrid and Multi-Objective Optimization Framework
by: Narayanaswamy Sridevi Kottapalli, et al.
Published: (2025-06-01)

Utility of Domain Adaptation for Biomass Yield Forecasting
by: Jonathan M. Vance, et al.
Published: (2025-07-01)

Legal status of non-fungible tokens (NFT): current state and prospects of legal regulation
by: V. O. Makarov
Published: (2023-06-01)

Identification of DNA N6-methyladenine modifications in the rice genome with a fine-tuned large language model
by: Yichi Zhang, et al.
Published: (2025-06-01)

Generating, retrieving persona and generating responses for long-term open-domain dialogue
by: Dohyun Cha, et al.
Published: (2025-07-01)

Method for Creating Domain-Specific Dataset Ontologies from Text in Uncontrolled English
by: Minab Shokoufeh Salem, et al.
Published: (2025-01-01)

Non-fungible tokens (NFT) and intellectual property: The triumph of the proprietary approach?
by: A. A. Dolganin
Published: (2021-11-01)

SoK: On the security of non-fungible tokens
by: Kai Ma, et al.
Published: (2025-06-01)

Towards bridging the synthetic-to-real gap in quantitative photoacoustic tomography via unsupervised domain adaptation
by: Zeqi Wang, et al.
Published: (2025-10-01)

Prediction of Alzheimer’s Disease Based on Multi-Modal Domain Adaptation
by: Binbin Fu, et al.
Published: (2025-06-01)

A survey of unsupervised domain adaptive semantic segmentation algorithms based on deep learning
by: Ying Junjie, et al.
Published: (2024-01-01)

Domain-Specific Languages for Algorithmic Graph Processing: A Systematic Literature Review
by: Houda Boukham, et al.
Published: (2025-07-01)

The Art of Tokenization: Blockchain Affordances and the Invention of Future Milieus
by: Laura Lotti
Published: (2019-08-01)

Large Language Models’ Trustworthiness in the Light of the EU AI Act—A Systematic Mapping Study
by: Md Masum Billah, et al.
Published: (2025-07-01)

The Cognitive Emotion Regulation Questionnaire (CERQ): The Evaluation of Structural and Convergent Validity on a Serbian Sample
by: Ilić Mihajlo, et al.
Published: (2025-01-01)

Comparison of pre-trained models for domain-specific entity extraction from student report documents
by: Antonina V. Melnikova, et al.
Published: (2025-03-01)