Harmonizing organ-at-risk structure names using open-source large language models

Background and purpose: Standardized radiotherapy structure nomenclature is crucial for automation, inter-institutional collaborations, and large-scale deep learning studies in radiation oncology. Despite the availability of nomenclature guidelines (AAPM-TG-263), their implementation is lacking and...

Full description

Saved in:

Bibliographic Details
Main Authors:	Adrian Thummerer, Matteo Maspero, Erik van der Bijl, Stefanie Corradini, Claus Belka, Guillaume Landry, Christopher Kurz
Format:	Article
Language:	English
Published:	Elsevier 2025-07-01
Series:	Physics and Imaging in Radiation Oncology
Subjects:	Large language models LLMs Structure renaming AAPM TG-263
Online Access:	http://www.sciencedirect.com/science/article/pii/S2405631625001186
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1839607627151048704
author	Adrian Thummerer Matteo Maspero Erik van der Bijl Stefanie Corradini Claus Belka Guillaume Landry Christopher Kurz
author_facet	Adrian Thummerer Matteo Maspero Erik van der Bijl Stefanie Corradini Claus Belka Guillaume Landry Christopher Kurz
author_sort	Adrian Thummerer
collection	DOAJ
description	Background and purpose: Standardized radiotherapy structure nomenclature is crucial for automation, inter-institutional collaborations, and large-scale deep learning studies in radiation oncology. Despite the availability of nomenclature guidelines (AAPM-TG-263), their implementation is lacking and still faces challenges. This study evaluated open-source large language models (LLMs) for automated organ-at-risk (OAR) renaming on a multi-institutional and multilingual dataset. Materials and methods: Four open-source LLMs (Llama 3.3, Llama 3.3 R1, DeepSeek V3, DeepSeek R1) were evaluated using a dataset of 34,177 OAR structures from 1684 patients collected at three university medical centers with manual TG-263 ground-truth labels. LLM renaming was performed using a few-shot prompting technique, including detailed instructions and generic examples. Performance was assessed by calculating renaming accuracy on the entire dataset and a unique dataset (duplicates removed). In addition, we performed a failure analysis, prompt-based confidence correlation, and Monte Carlo sampling-based uncertainty estimation. Results: High renaming accuracy was achieved, with the reasoning-enhanced DeepSeek R1 model performing best (98.6 % unique accuracy, 99.9 % overall accuracy). Overall, reasoning models outperformed their non-reasoning counterparts. Monte Carlo sampling showed a stronger correlation with prediction errors (correlation coefficient of 0.70 for DeepSeek R1) and better error detection (Sensitivity 0.73, Specificity 1.0 for DeepSeek R1) compared to prompt-based confidence estimation (correlation coefficient < 0.42). Conclusions: Open-source LLMs, particularly those with reasoning capabilities, can accurately harmonize OAR nomenclature according to TG-263 across diverse multilingual and multi-institutional datasets. They can also facilitate TG-263 nomenclature adoption and the creation of large, standardized datasets for research and AI development.
format	Article
id	doaj-art-a04f92caaa3b40be9e3b83af00c2932b
institution	Matheson Library
issn	2405-6316
language	English
publishDate	2025-07-01
publisher	Elsevier
record_format	Article
series	Physics and Imaging in Radiation Oncology
spelling	doaj-art-a04f92caaa3b40be9e3b83af00c2932b2025-08-01T04:44:47ZengElsevierPhysics and Imaging in Radiation Oncology2405-63162025-07-0135100813Harmonizing organ-at-risk structure names using open-source large language modelsAdrian Thummerer0Matteo Maspero1Erik van der Bijl2Stefanie Corradini3Claus Belka4Guillaume Landry5Christopher Kurz6Department of Radiation Oncology, LMU University Hospital, LMU Munich, Munich, Germany; Corresponding author at: Department of Radiation Oncology, LMU University Hospital, LMU Munich, 81377 Munich, Germany.Department of Radiation Oncology, University Medical Center Utrecht, Utrecht, the NetherlandsDepartment of Radiation Oncology, Radboud University Medical Center, Nijmegen, the NetherlandsDepartment of Radiation Oncology, LMU University Hospital, LMU Munich, Munich, GermanyDepartment of Radiation Oncology, LMU University Hospital, LMU Munich, Munich, Germany; German Cancer Consortium (DKTK), partner site Munich, a partnership between DKFZ and LMU University Hospital Munich Germany, Munich, Germany; Bavarian Cancer Research Center (BZKF), Munich, GermanyDepartment of Radiation Oncology, LMU University Hospital, LMU Munich, Munich, GermanyDepartment of Radiation Oncology, LMU University Hospital, LMU Munich, Munich, GermanyBackground and purpose: Standardized radiotherapy structure nomenclature is crucial for automation, inter-institutional collaborations, and large-scale deep learning studies in radiation oncology. Despite the availability of nomenclature guidelines (AAPM-TG-263), their implementation is lacking and still faces challenges. This study evaluated open-source large language models (LLMs) for automated organ-at-risk (OAR) renaming on a multi-institutional and multilingual dataset. Materials and methods: Four open-source LLMs (Llama 3.3, Llama 3.3 R1, DeepSeek V3, DeepSeek R1) were evaluated using a dataset of 34,177 OAR structures from 1684 patients collected at three university medical centers with manual TG-263 ground-truth labels. LLM renaming was performed using a few-shot prompting technique, including detailed instructions and generic examples. Performance was assessed by calculating renaming accuracy on the entire dataset and a unique dataset (duplicates removed). In addition, we performed a failure analysis, prompt-based confidence correlation, and Monte Carlo sampling-based uncertainty estimation. Results: High renaming accuracy was achieved, with the reasoning-enhanced DeepSeek R1 model performing best (98.6 % unique accuracy, 99.9 % overall accuracy). Overall, reasoning models outperformed their non-reasoning counterparts. Monte Carlo sampling showed a stronger correlation with prediction errors (correlation coefficient of 0.70 for DeepSeek R1) and better error detection (Sensitivity 0.73, Specificity 1.0 for DeepSeek R1) compared to prompt-based confidence estimation (correlation coefficient < 0.42). Conclusions: Open-source LLMs, particularly those with reasoning capabilities, can accurately harmonize OAR nomenclature according to TG-263 across diverse multilingual and multi-institutional datasets. They can also facilitate TG-263 nomenclature adoption and the creation of large, standardized datasets for research and AI development.http://www.sciencedirect.com/science/article/pii/S2405631625001186Large language modelsLLMsStructure renamingAAPM TG-263
spellingShingle	Adrian Thummerer Matteo Maspero Erik van der Bijl Stefanie Corradini Claus Belka Guillaume Landry Christopher Kurz Harmonizing organ-at-risk structure names using open-source large language models Physics and Imaging in Radiation Oncology Large language models LLMs Structure renaming AAPM TG-263
title	Harmonizing organ-at-risk structure names using open-source large language models
title_full	Harmonizing organ-at-risk structure names using open-source large language models
title_fullStr	Harmonizing organ-at-risk structure names using open-source large language models
title_full_unstemmed	Harmonizing organ-at-risk structure names using open-source large language models
title_short	Harmonizing organ-at-risk structure names using open-source large language models
title_sort	harmonizing organ at risk structure names using open source large language models
topic	Large language models LLMs Structure renaming AAPM TG-263
url	http://www.sciencedirect.com/science/article/pii/S2405631625001186
work_keys_str_mv	AT adrianthummerer harmonizingorganatriskstructurenamesusingopensourcelargelanguagemodels AT matteomaspero harmonizingorganatriskstructurenamesusingopensourcelargelanguagemodels AT erikvanderbijl harmonizingorganatriskstructurenamesusingopensourcelargelanguagemodels AT stefaniecorradini harmonizingorganatriskstructurenamesusingopensourcelargelanguagemodels AT clausbelka harmonizingorganatriskstructurenamesusingopensourcelargelanguagemodels AT guillaumelandry harmonizingorganatriskstructurenamesusingopensourcelargelanguagemodels AT christopherkurz harmonizingorganatriskstructurenamesusingopensourcelargelanguagemodels

Harmonizing organ-at-risk structure names using open-source large language models

Similar Items