Can large language models generate geospatial code?

As large language models increasingly exhibit hallucinations such as refusal to respond, generation of non-executable code, and poor readability in geospatial code generation tasks, establishing a systematic and quantifiable evaluation framework has become essential for advancing their application i...

Full description

Saved in:

Bibliographic Details
Main Authors:	Shuyang Hou, Zhangxiao Shen, Jianyuan Liang, Haoyue Jiao, Anqi Zhao, Yaxian Qing, Dehua Peng, Zhipeng Gui, Xuefeng Guan, Longgang Xiang, Huayi Wu
Format:	Article
Language:	English
Published:	Taylor & Francis Group 2025-08-01
Series:	Geo-spatial Information Science
Subjects:	Geospatial code generation large language models GIS DeepSeek LLM evaluation
Online Access:	https://www.tandfonline.com/doi/10.1080/10095020.2025.2535523
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	As large language models increasingly exhibit hallucinations such as refusal to respond, generation of non-executable code, and poor readability in geospatial code generation tasks, establishing a systematic and quantifiable evaluation framework has become essential for advancing their application in GIS. This paper introduces the GeoCode-Eval framework, the first comprehensive evaluation framework for LLMs in geospatial code generation. Grounded in three dimensions – cognition and memory, understanding and interpretation, and innovation and creation – the framework addresses eight competency levels, including platform and tool cognition, functional knowledge, dataset recognition, information extraction, and various code-related tasks. To support this, the GeoCode-Bench benchmark was developed, consisting of 5,000 multiple-choice questions, 1,500 true/false questions, 1,500 fill-in-the-blank questions, and 1,000 coding tasks. Using six indicators, namely executability, accuracy, readability, location correctness, content correctness, and summary completeness, the study evaluates twelve representative models spanning four categories, including DeepSeek-Coder-V2 and GeoCode-GPT (7B). A combination of analytical methods, including entropy weighting, the coefficient of variation, skewness, and kurtosis, is applied to examine model capability distribution, indicator distribution, code type characteristics, and error type patterns. Results show consistent performance in tool cognition and code summarization, while significant performance gaps persist in code generation, completion, and correction. Common errors include data type and syntax issues. This study provides a quantifiable foundation for the evaluation of capabilities and future optimization of LLMs in geospatial code generation, thereby extending the application boundaries of LLMs in GIS and offering valuable insights into the development of evaluation methodologies for LLM applications in other vertical domains.
ISSN:	1009-5020 1993-5153

Can large language models generate geospatial code?

Similar Items