An annotated morphological dataset for Uzbek word forms: Towards rule-based and machine learning approachesMendeley Data
This research paper presents a morphologically annotated dataset for the Uzbek language, specifically designed for morphological analysis algorithms. The dataset contains 3022 manually annotated word forms, each annotated with root, affix, and part-of-speech information. Two morphological analysis a...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2025-08-01
|
Series: | Data in Brief |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2352340925004329 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | This research paper presents a morphologically annotated dataset for the Uzbek language, specifically designed for morphological analysis algorithms. The dataset contains 3022 manually annotated word forms, each annotated with root, affix, and part-of-speech information. Two morphological analysis approaches were implemented and compared: a user-defined rule-based stemming algorithm and a conditional random fields (CRF)-based machine learning model. Additionally, comprehensive genre testing was conducted on legal, political-economic, and educational texts to assess generalizability. The dataset is publicly available in Excel format and is intended as a base resource for further research in the field of natural language processing in Uzbek, including applications in text generation, semantic analysis, and grammar correction. |
---|---|
ISSN: | 2352-3409 |