CHTopo: A Multi-Source Large-Scale Chinese Toponym Annotation Corpus

Toponyms are fundamental geographical resources characterized by their spatial attributes, distinct from general nouns. While natural language provides rich toponymic data beyond traditional surveying methods, its qualitative ambiguity and inherent uncertainty challenge systematic extraction. Tradit...

Full description

Saved in:
Bibliographic Details
Main Authors: Peng Ye, Yujin Jiang, Yadi Wang
Format: Article
Language:English
Published: MDPI AG 2025-07-01
Series:Information
Subjects:
Online Access:https://www.mdpi.com/2078-2489/16/7/610
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Toponyms are fundamental geographical resources characterized by their spatial attributes, distinct from general nouns. While natural language provides rich toponymic data beyond traditional surveying methods, its qualitative ambiguity and inherent uncertainty challenge systematic extraction. Traditional toponym recognition methods based on part-of-speech tagging only focus on the surface-level features of words, failing to effectively handle complex scenarios such as alias nesting, metonymy ambiguity, and mixed punctuation. This leads to the loss of toponym semantic integrity and deviations in geographic entity recognition. This study proposes a set of Chinese toponym annotation specifications that integrate spatial semantics. By leveraging the XML markup language, it deeply combines the spatial location characteristics of toponyms with linguistic features, and designs fine-grained annotation rules to address the limitations of traditional methods in semantic integrity and geographic entity recognition. On this basis, by integrating multi-source corpora from the <i>Encyclopedia of China: Chinese Geography</i> and <i>People’s Daily</i>, a large-scale Chinese toponym annotation corpus (CHTopo) covering five major categories of toponyms has been constructed. The performance of this annotated corpus was evaluated through toponym recognition, exploring the construction methods of a large-scale, diversified, and high-coverage Chinese toponym annotated corpus from the perspectives of applicability and practicality. CHTopo is conducive to providing foundational support for geographic information extraction, spatial knowledge graphs, and geoparsing research, bridging linguistic and geospatial intelligence.
ISSN:2078-2489