An annotated morphological dataset for Uzbek word forms: Towards rule-based and machine learning approachesMendeley Data

This research paper presents a morphologically annotated dataset for the Uzbek language, specifically designed for morphological analysis algorithms. The dataset contains 3022 manually annotated word forms, each annotated with root, affix, and part-of-speech information. Two morphological analysis a...

Full description

Saved in:
Bibliographic Details
Main Authors: Nilufar Abdurakhmonova, Raima Shirinova, Rano Sayfullayeva, Davlatyor Mengliev, Bahodir Ibragimov, Manzura Ernazarova
Format: Article
Language:English
Published: Elsevier 2025-08-01
Series:Data in Brief
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2352340925004329
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This research paper presents a morphologically annotated dataset for the Uzbek language, specifically designed for morphological analysis algorithms. The dataset contains 3022 manually annotated word forms, each annotated with root, affix, and part-of-speech information. Two morphological analysis approaches were implemented and compared: a user-defined rule-based stemming algorithm and a conditional random fields (CRF)-based machine learning model. Additionally, comprehensive genre testing was conducted on legal, political-economic, and educational texts to assess generalizability. The dataset is publicly available in Excel format and is intended as a base resource for further research in the field of natural language processing in Uzbek, including applications in text generation, semantic analysis, and grammar correction.
ISSN:2352-3409