An annotated morphological dataset for Uzbek word forms: Towards rule-based and machine learning approachesMendeley Data

This research paper presents a morphologically annotated dataset for the Uzbek language, specifically designed for morphological analysis algorithms. The dataset contains 3022 manually annotated word forms, each annotated with root, affix, and part-of-speech information. Two morphological analysis a...

Full description

Saved in:

Bibliographic Details
Main Authors:	Nilufar Abdurakhmonova, Raima Shirinova, Rano Sayfullayeva, Davlatyor Mengliev, Bahodir Ibragimov, Manzura Ernazarova
Format:	Article
Language:	English
Published:	Elsevier 2025-08-01
Series:	Data in Brief
Subjects:	Morphological analysis Low-resource languages Uzbek language Language corpus Linguistic research
Online Access:	http://www.sciencedirect.com/science/article/pii/S2352340925004329
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	This research paper presents a morphologically annotated dataset for the Uzbek language, specifically designed for morphological analysis algorithms. The dataset contains 3022 manually annotated word forms, each annotated with root, affix, and part-of-speech information. Two morphological analysis approaches were implemented and compared: a user-defined rule-based stemming algorithm and a conditional random fields (CRF)-based machine learning model. Additionally, comprehensive genre testing was conducted on legal, political-economic, and educational texts to assess generalizability. The dataset is publicly available in Excel format and is intended as a base resource for further research in the field of natural language processing in Uzbek, including applications in text generation, semantic analysis, and grammar correction.
ISSN:	2352-3409

An annotated morphological dataset for Uzbek word forms: Towards rule-based and machine learning approachesMendeley Data

Similar Items