Can large language models generate geospatial code?
As large language models increasingly exhibit hallucinations such as refusal to respond, generation of non-executable code, and poor readability in geospatial code generation tasks, establishing a systematic and quantifiable evaluation framework has become essential for advancing their application i...
Saved in:
Main Authors: | , , , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Taylor & Francis Group
2025-08-01
|
Series: | Geo-spatial Information Science |
Subjects: | |
Online Access: | https://www.tandfonline.com/doi/10.1080/10095020.2025.2535523 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | As large language models increasingly exhibit hallucinations such as refusal to respond, generation of non-executable code, and poor readability in geospatial code generation tasks, establishing a systematic and quantifiable evaluation framework has become essential for advancing their application in GIS. This paper introduces the GeoCode-Eval framework, the first comprehensive evaluation framework for LLMs in geospatial code generation. Grounded in three dimensions – cognition and memory, understanding and interpretation, and innovation and creation – the framework addresses eight competency levels, including platform and tool cognition, functional knowledge, dataset recognition, information extraction, and various code-related tasks. To support this, the GeoCode-Bench benchmark was developed, consisting of 5,000 multiple-choice questions, 1,500 true/false questions, 1,500 fill-in-the-blank questions, and 1,000 coding tasks. Using six indicators, namely executability, accuracy, readability, location correctness, content correctness, and summary completeness, the study evaluates twelve representative models spanning four categories, including DeepSeek-Coder-V2 and GeoCode-GPT (7B). A combination of analytical methods, including entropy weighting, the coefficient of variation, skewness, and kurtosis, is applied to examine model capability distribution, indicator distribution, code type characteristics, and error type patterns. Results show consistent performance in tool cognition and code summarization, while significant performance gaps persist in code generation, completion, and correction. Common errors include data type and syntax issues. This study provides a quantifiable foundation for the evaluation of capabilities and future optimization of LLMs in geospatial code generation, thereby extending the application boundaries of LLMs in GIS and offering valuable insights into the development of evaluation methodologies for LLM applications in other vertical domains. |
---|---|
ISSN: | 1009-5020 1993-5153 |