Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study
IntroductionArtificial Intelligence (AI) chatbots, which generate human-like responses based on extensive data, are becoming important tools in healthcare by providing information on health conditions, treatments, and preventive measures, acting as virtual assistants. However, their performance in a...
Saved in:
Main Authors: | , , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Frontiers Media S.A.
2025-06-01
|
Series: | Frontiers in Digital Health |
Subjects: | |
Online Access: | https://www.frontiersin.org/articles/10.3389/fdgth.2025.1574287/full |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1839650553124093952 |
---|---|
author | Giacomo Rossettini Giacomo Rossettini Silvia Bargeri Chad Cook Chad Cook Chad Cook Stefania Guida Alvisa Palese Lia Rodeghiero Paolo Pillastrini Paolo Pillastrini Andrea Turolla Andrea Turolla Greta Castellini Silvia Gianola |
author_facet | Giacomo Rossettini Giacomo Rossettini Silvia Bargeri Chad Cook Chad Cook Chad Cook Stefania Guida Alvisa Palese Lia Rodeghiero Paolo Pillastrini Paolo Pillastrini Andrea Turolla Andrea Turolla Greta Castellini Silvia Gianola |
author_sort | Giacomo Rossettini |
collection | DOAJ |
description | IntroductionArtificial Intelligence (AI) chatbots, which generate human-like responses based on extensive data, are becoming important tools in healthcare by providing information on health conditions, treatments, and preventive measures, acting as virtual assistants. However, their performance in aligning with clinical practice guidelines (CPGs) for providing answers to complex clinical questions on lumbosacral radicular pain is still unclear. We aim to evaluate AI chatbots' performance against CPG recommendations for diagnosing and treating lumbosacral radicular pain.MethodsWe performed a cross-sectional study to assess AI chatbots' responses against CPGs recommendations for diagnosing and treating lumbosacral radicular pain. Clinical questions based on these CPGs were posed to the latest versions (updated in 2024) of six AI chatbots: ChatGPT-3.5, ChatGPT-4o, Microsoft Copilot, Google Gemini, Claude, and Perplexity. The chatbots' responses were evaluated for (a) consistency of text responses using Plagiarism Checker X, (b) intra- and inter-rater reliability using Fleiss' Kappa, and (c) match rate with CPGs. Statistical analyses were performed with STATA/MP 16.1.ResultsWe found high variability in the text consistency of AI chatbot responses (median range 26%–68%). Intra-rater reliability ranged from “almost perfect” to “substantial,” while inter-rater reliability varied from “almost perfect” to “moderate.” Perplexity had the highest match rate at 67%, followed by Google Gemini at 63%, and Microsoft Copilot at 44%. ChatGPT-3.5, ChatGPT-4o, and Claude showed the lowest performance, each with a 33% match rate.ConclusionsDespite the variability in internal consistency and good intra- and inter-rater reliability, the AI Chatbots' recommendations often did not align with CPGs recommendations for diagnosing and treating lumbosacral radicular pain. Clinicians and patients should exercise caution when relying on these AI models, since one to two-thirds of the recommendations provided may be inappropriate or misleading according to specific chatbots. |
format | Article |
id | doaj-art-81b091dc8fc94585a74d0c6c04440001 |
institution | Matheson Library |
issn | 2673-253X |
language | English |
publishDate | 2025-06-01 |
publisher | Frontiers Media S.A. |
record_format | Article |
series | Frontiers in Digital Health |
spelling | doaj-art-81b091dc8fc94585a74d0c6c044400012025-06-27T05:31:40ZengFrontiers Media S.A.Frontiers in Digital Health2673-253X2025-06-01710.3389/fdgth.2025.15742871574287Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional studyGiacomo Rossettini0Giacomo Rossettini1Silvia Bargeri2Chad Cook3Chad Cook4Chad Cook5Stefania Guida6Alvisa Palese7Lia Rodeghiero8Paolo Pillastrini9Paolo Pillastrini10Andrea Turolla11Andrea Turolla12Greta Castellini13Silvia Gianola14School of Physiotherapy, University of Verona, Verona, ItalyDepartment of Physiotherapy, Faculty of Medicine, Health and Sports, Universidad Europea de Madrid, Madrid, SpainUnit of Clinical Epidemiology, IRCCS Istituto Ortopedico Galeazzi, Milan, ItalyDepartment of Orthopaedics, Duke University, Durham, NC, United StatesDuke Clinical Research Institute, Duke University, Durham, NC, United StatesDepartment of Population Health Sciences, Duke University, Durham, NC, United StatesUnit of Clinical Epidemiology, IRCCS Istituto Ortopedico Galeazzi, Milan, ItalyDepartment of Medical Sciences, University of Udine, Udine, ItalyDepartment of Rehabilitation, Hospital of Merano (SABES-ASDAA), Teaching Hospital of Paracelsus Medical University (PMU), Merano-Meran, ItalyDepartment of Biomedical and Neuromotor Sciences (DIBINEM), Alma Mater University of Bologna, Bologna, Italy0Unit of Occupational Medicine, IRCCS Azienda Ospedaliero-Universitaria di Bologna, Bologna, ItalyDepartment of Biomedical and Neuromotor Sciences (DIBINEM), Alma Mater University of Bologna, Bologna, Italy0Unit of Occupational Medicine, IRCCS Azienda Ospedaliero-Universitaria di Bologna, Bologna, ItalyUnit of Clinical Epidemiology, IRCCS Istituto Ortopedico Galeazzi, Milan, ItalyUnit of Clinical Epidemiology, IRCCS Istituto Ortopedico Galeazzi, Milan, ItalyIntroductionArtificial Intelligence (AI) chatbots, which generate human-like responses based on extensive data, are becoming important tools in healthcare by providing information on health conditions, treatments, and preventive measures, acting as virtual assistants. However, their performance in aligning with clinical practice guidelines (CPGs) for providing answers to complex clinical questions on lumbosacral radicular pain is still unclear. We aim to evaluate AI chatbots' performance against CPG recommendations for diagnosing and treating lumbosacral radicular pain.MethodsWe performed a cross-sectional study to assess AI chatbots' responses against CPGs recommendations for diagnosing and treating lumbosacral radicular pain. Clinical questions based on these CPGs were posed to the latest versions (updated in 2024) of six AI chatbots: ChatGPT-3.5, ChatGPT-4o, Microsoft Copilot, Google Gemini, Claude, and Perplexity. The chatbots' responses were evaluated for (a) consistency of text responses using Plagiarism Checker X, (b) intra- and inter-rater reliability using Fleiss' Kappa, and (c) match rate with CPGs. Statistical analyses were performed with STATA/MP 16.1.ResultsWe found high variability in the text consistency of AI chatbot responses (median range 26%–68%). Intra-rater reliability ranged from “almost perfect” to “substantial,” while inter-rater reliability varied from “almost perfect” to “moderate.” Perplexity had the highest match rate at 67%, followed by Google Gemini at 63%, and Microsoft Copilot at 44%. ChatGPT-3.5, ChatGPT-4o, and Claude showed the lowest performance, each with a 33% match rate.ConclusionsDespite the variability in internal consistency and good intra- and inter-rater reliability, the AI Chatbots' recommendations often did not align with CPGs recommendations for diagnosing and treating lumbosacral radicular pain. Clinicians and patients should exercise caution when relying on these AI models, since one to two-thirds of the recommendations provided may be inappropriate or misleading according to specific chatbots.https://www.frontiersin.org/articles/10.3389/fdgth.2025.1574287/fullartificial intelligencephysiotherapymachine learningmusculoskeletalnatural language processingorthopaedics |
spellingShingle | Giacomo Rossettini Giacomo Rossettini Silvia Bargeri Chad Cook Chad Cook Chad Cook Stefania Guida Alvisa Palese Lia Rodeghiero Paolo Pillastrini Paolo Pillastrini Andrea Turolla Andrea Turolla Greta Castellini Silvia Gianola Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study Frontiers in Digital Health artificial intelligence physiotherapy machine learning musculoskeletal natural language processing orthopaedics |
title | Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study |
title_full | Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study |
title_fullStr | Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study |
title_full_unstemmed | Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study |
title_short | Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study |
title_sort | accuracy of chatgpt 3 5 chatgpt 4o copilot gemini claude and perplexity in advising on lumbosacral radicular pain against clinical practice guidelines cross sectional study |
topic | artificial intelligence physiotherapy machine learning musculoskeletal natural language processing orthopaedics |
url | https://www.frontiersin.org/articles/10.3389/fdgth.2025.1574287/full |
work_keys_str_mv | AT giacomorossettini accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy AT giacomorossettini accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy AT silviabargeri accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy AT chadcook accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy AT chadcook accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy AT chadcook accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy AT stefaniaguida accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy AT alvisapalese accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy AT liarodeghiero accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy AT paolopillastrini accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy AT paolopillastrini accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy AT andreaturolla accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy AT andreaturolla accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy AT gretacastellini accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy AT silviagianola accuracyofchatgpt35chatgpt4ocopilotgeminiclaudeandperplexityinadvisingonlumbosacralradicularpainagainstclinicalpracticeguidelinescrosssectionalstudy |