Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study

IntroductionArtificial Intelligence (AI) chatbots, which generate human-like responses based on extensive data, are becoming important tools in healthcare by providing information on health conditions, treatments, and preventive measures, acting as virtual assistants. However, their performance in a...

Full description

Saved in:
Bibliographic Details
Main Authors: Giacomo Rossettini, Silvia Bargeri, Chad Cook, Stefania Guida, Alvisa Palese, Lia Rodeghiero, Paolo Pillastrini, Andrea Turolla, Greta Castellini, Silvia Gianola
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-06-01
Series:Frontiers in Digital Health
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fdgth.2025.1574287/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1839650553124093952
author Giacomo Rossettini
Giacomo Rossettini
Silvia Bargeri
Chad Cook
Chad Cook
Chad Cook
Stefania Guida
Alvisa Palese
Lia Rodeghiero
Paolo Pillastrini
Paolo Pillastrini
Andrea Turolla
Andrea Turolla
Greta Castellini
Silvia Gianola
author_facet Giacomo Rossettini
Giacomo Rossettini
Silvia Bargeri
Chad Cook
Chad Cook
Chad Cook
Stefania Guida
Alvisa Palese
Lia Rodeghiero
Paolo Pillastrini
Paolo Pillastrini
Andrea Turolla
Andrea Turolla
Greta Castellini
Silvia Gianola
author_sort Giacomo Rossettini
collection DOAJ
description IntroductionArtificial Intelligence (AI) chatbots, which generate human-like responses based on extensive data, are becoming important tools in healthcare by providing information on health conditions, treatments, and preventive measures, acting as virtual assistants. However, their performance in aligning with clinical practice guidelines (CPGs) for providing answers to complex clinical questions on lumbosacral radicular pain is still unclear. We aim to evaluate AI chatbots' performance against CPG recommendations for diagnosing and treating lumbosacral radicular pain.MethodsWe performed a cross-sectional study to assess AI chatbots' responses against CPGs recommendations for diagnosing and treating lumbosacral radicular pain. Clinical questions based on these CPGs were posed to the latest versions (updated in 2024) of six AI chatbots: ChatGPT-3.5, ChatGPT-4o, Microsoft Copilot, Google Gemini, Claude, and Perplexity. The chatbots' responses were evaluated for (a) consistency of text responses using Plagiarism Checker X, (b) intra- and inter-rater reliability using Fleiss' Kappa, and (c) match rate with CPGs. Statistical analyses were performed with STATA/MP 16.1.ResultsWe found high variability in the text consistency of AI chatbot responses (median range 26%–68%). Intra-rater reliability ranged from “almost perfect” to “substantial,” while inter-rater reliability varied from “almost perfect” to “moderate.” Perplexity had the highest match rate at 67%, followed by Google Gemini at 63%, and Microsoft Copilot at 44%. ChatGPT-3.5, ChatGPT-4o, and Claude showed the lowest performance, each with a 33% match rate.ConclusionsDespite the variability in internal consistency and good intra- and inter-rater reliability, the AI Chatbots' recommendations often did not align with CPGs recommendations for diagnosing and treating lumbosacral radicular pain. Clinicians and patients should exercise caution when relying on these AI models, since one to two-thirds of the recommendations provided may be inappropriate or misleading according to specific chatbots.
format Article
id doaj-art-81b091dc8fc94585a74d0c6c04440001
institution Matheson Library
issn 2673-253X
language English
publishDate 2025-06-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Digital Health
spelling doaj-art-81b091dc8fc94585a74d0c6c044400012025-06-27T05:31:40ZengFrontiers Media S.A.Frontiers in Digital Health2673-253X2025-06-01710.3389/fdgth.2025.15742871574287Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional studyGiacomo Rossettini0Giacomo Rossettini1Silvia Bargeri2Chad Cook3Chad Cook4Chad Cook5Stefania Guida6Alvisa Palese7Lia Rodeghiero8Paolo Pillastrini9Paolo Pillastrini10Andrea Turolla11Andrea Turolla12Greta Castellini13Silvia Gianola14School of Physiotherapy, University of Verona, Verona, ItalyDepartment of Physiotherapy, Faculty of Medicine, Health and Sports, Universidad Europea de Madrid, Madrid, SpainUnit of Clinical Epidemiology, IRCCS Istituto Ortopedico Galeazzi, Milan, ItalyDepartment of Orthopaedics, Duke University, Durham, NC, United StatesDuke Clinical Research Institute, Duke University, Durham, NC, United StatesDepartment of Population Health Sciences, Duke University, Durham, NC, United StatesUnit of Clinical Epidemiology, IRCCS Istituto Ortopedico Galeazzi, Milan, ItalyDepartment of Medical Sciences, University of Udine, Udine, ItalyDepartment of Rehabilitation, Hospital of Merano (SABES-ASDAA), Teaching Hospital of Paracelsus Medical University (PMU), Merano-Meran, ItalyDepartment of Biomedical and Neuromotor Sciences (DIBINEM), Alma Mater University of Bologna, Bologna, Italy0Unit of Occupational Medicine, IRCCS Azienda Ospedaliero-Universitaria di Bologna, Bologna, ItalyDepartment of Biomedical and Neuromotor Sciences (DIBINEM), Alma Mater University of Bologna, Bologna, Italy0Unit of Occupational Medicine, IRCCS Azienda Ospedaliero-Universitaria di Bologna, Bologna, ItalyUnit of Clinical Epidemiology, IRCCS Istituto Ortopedico Galeazzi, Milan, ItalyUnit of Clinical Epidemiology, IRCCS Istituto Ortopedico Galeazzi, Milan, ItalyIntroductionArtificial Intelligence (AI) chatbots, which generate human-like responses based on extensive data, are becoming important tools in healthcare by providing information on health conditions, treatments, and preventive measures, acting as virtual assistants. However, their performance in aligning with clinical practice guidelines (CPGs) for providing answers to complex clinical questions on lumbosacral radicular pain is still unclear. We aim to evaluate AI chatbots' performance against CPG recommendations for diagnosing and treating lumbosacral radicular pain.MethodsWe performed a cross-sectional study to assess AI chatbots' responses against CPGs recommendations for diagnosing and treating lumbosacral radicular pain. Clinical questions based on these CPGs were posed to the latest versions (updated in 2024) of six AI chatbots: ChatGPT-3.5, ChatGPT-4o, Microsoft Copilot, Google Gemini, Claude, and Perplexity. The chatbots' responses were evaluated for (a) consistency of text responses using Plagiarism Checker X, (b) intra- and inter-rater reliability using Fleiss' Kappa, and (c) match rate with CPGs. Statistical analyses were performed with STATA/MP 16.1.ResultsWe found high variability in the text consistency of AI chatbot responses (median range 26%–68%). Intra-rater reliability ranged from “almost perfect” to “substantial,” while inter-rater reliability varied from “almost perfect” to “moderate.” Perplexity had the highest match rate at 67%, followed by Google Gemini at 63%, and Microsoft Copilot at 44%. ChatGPT-3.5, ChatGPT-4o, and Claude showed the lowest performance, each with a 33% match rate.ConclusionsDespite the variability in internal consistency and good intra- and inter-rater reliability, the AI Chatbots' recommendations often did not align with CPGs recommendations for diagnosing and treating lumbosacral radicular pain. Clinicians and patients should exercise caution when relying on these AI models, since one to two-thirds of the recommendations provided may be inappropriate or misleading according to specific chatbots.https://www.frontiersin.org/articles/10.3389/fdgth.2025.1574287/fullartificial intelligencephysiotherapymachine learningmusculoskeletalnatural language processingorthopaedics
spellingShingle Giacomo Rossettini
Giacomo Rossettini
Silvia Bargeri
Chad Cook
Chad Cook
Chad Cook
Stefania Guida
Alvisa Palese
Lia Rodeghiero
Paolo Pillastrini
Paolo Pillastrini
Andrea Turolla
Andrea Turolla
Greta Castellini
Silvia Gianola
Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study
Frontiers in Digital Health
artificial intelligence
physiotherapy
machine learning
musculoskeletal
natural language processing
orthopaedics
title Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study
title_full Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study
title_fullStr Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study
title_full_unstemmed Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study
title_short Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study
title_sort accuracy of chatgpt 3 5 chatgpt 4o copilot gemini claude and perplexity in advising on lumbosacral radicular pain against clinical practice guidelines cross sectional study
topic artificial intelligence
physiotherapy
machine learning
musculoskeletal
natural language processing
orthopaedics
url https://www.frontiersin.org/articles/10.3389/fdgth.2025.1574287/full
work_keys_str_mv AT giacomorossettini accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy
AT giacomorossettini accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy
AT silviabargeri accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy
AT chadcook accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy
AT chadcook accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy
AT chadcook accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy
AT stefaniaguida accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy
AT alvisapalese accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy
AT liarodeghiero accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy
AT paolopillastrini accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy
AT paolopillastrini accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy
AT andreaturolla accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy
AT andreaturolla accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy
AT gretacastellini accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy
AT silviagianola accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy