Performance Comparison of Large Language Models on Brazil’s Medical Revalidation Exam for Foreign-Trained Graduates
This study aimed to compare the performance of various Large Language Models (LLMs) in answering multiple-choice questions from the last six editions (2017 to 2024) of Revalida exam. The evaluation focused on models capable of processing content in Brazilian Portuguese (PT-BR), including open-source...
Saved in:
Main Authors: | , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2025-06-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/15/13/7134 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1839632370595004416 |
---|---|
author | Renato Freitas Bessa Adonias Caetano de Oliveira Rafael Freitas Bessa Daniel Lima Sousa Rafaela Alves Amanda Barbosa Alinne Carneiro Carla Soares Ariel Soares Teles |
author_facet | Renato Freitas Bessa Adonias Caetano de Oliveira Rafael Freitas Bessa Daniel Lima Sousa Rafaela Alves Amanda Barbosa Alinne Carneiro Carla Soares Ariel Soares Teles |
author_sort | Renato Freitas Bessa |
collection | DOAJ |
description | This study aimed to compare the performance of various Large Language Models (LLMs) in answering multiple-choice questions from the last six editions (2017 to 2024) of Revalida exam. The evaluation focused on models capable of processing content in Brazilian Portuguese (PT-BR), including open-source models, namely LLaMA 3.1 (8B parameters), Qwen 2.5 (7B parameters), and their reasoning-oriented distilled variants based on the DeepSeek-R1 architecture, as well as open-access commercial models such as GPT-3.5, GPT-4o, and Gemini. After evaluating the models’ accuracy against the official answer keys, GPT-4o emerged as the top-performing model, achieving an average accuracy of 63.85%. Next, GPT-4o was prompted to justify its answers to the 2024 exam, and its explanations were independently reviewed by three licensed physicians. The evaluators reported full agreement with the clinical reasoning presented, indicating the model’s ability to produce coherent and medically relevant justifications. Lastly, justifications generated by GPT-4o for correctly answered questions from previous exams (2017–2023) were compiled into a knowledge base, which was then used to enhance GPT-4o through retrieval-augmented generation and to fine-tune LLaMA 3.1, leading to measurable performance improvements on the 2024 exam. Despite promising performance, these models still demonstrate variability in responses, hallucinations, and limited reliability in high-stakes contexts. As such, their outputs should always be reviewed by qualified professionals, and human expertise remains essential in clinical decision-making and medical education scenarios, considering the PT-BR language. However, the observed gains from integrating prior exam content indicate that domain-specific adaptation strategies may help mitigate some of these limitations and enhance model alignment. |
format | Article |
id | doaj-art-1e69fd897d1f47fab97f3d341962adc6 |
institution | Matheson Library |
issn | 2076-3417 |
language | English |
publishDate | 2025-06-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj-art-1e69fd897d1f47fab97f3d341962adc62025-07-11T14:35:46ZengMDPI AGApplied Sciences2076-34172025-06-011513713410.3390/app15137134Performance Comparison of Large Language Models on Brazil’s Medical Revalidation Exam for Foreign-Trained GraduatesRenato Freitas Bessa0Adonias Caetano de Oliveira1Rafael Freitas Bessa2Daniel Lima Sousa3Rafaela Alves4Amanda Barbosa5Alinne Carneiro6Carla Soares7Ariel Soares Teles8Postgraduate Program in Biotechnology, Parnaíba Delta Federal University, Parnaíba 64202-020, BrazilPostgraduate Program in Biotechnology, Parnaíba Delta Federal University, Parnaíba 64202-020, BrazilPostgraduate Program in Biotechnology, Parnaíba Delta Federal University, Parnaíba 64202-020, BrazilPostgraduate Program in Biotechnology, Parnaíba Delta Federal University, Parnaíba 64202-020, BrazilFaculty of Human, Exact and Health Sciences, Parnaíba Valley Higher Education Institute (IESVAP/AFYA), Parnaíba 64212-790, BrazilParnaíba Unit, Universidade Paulista (UNIP), Parnaíba 01311-000, BrazilFaculty of Human, Exact and Health Sciences, Parnaíba Valley Higher Education Institute (IESVAP/AFYA), Parnaíba 64212-790, BrazilFaculty of Human, Exact and Health Sciences, Parnaíba Valley Higher Education Institute (IESVAP/AFYA), Parnaíba 64212-790, BrazilPostgraduate Program in Biotechnology, Parnaíba Delta Federal University, Parnaíba 64202-020, BrazilThis study aimed to compare the performance of various Large Language Models (LLMs) in answering multiple-choice questions from the last six editions (2017 to 2024) of Revalida exam. The evaluation focused on models capable of processing content in Brazilian Portuguese (PT-BR), including open-source models, namely LLaMA 3.1 (8B parameters), Qwen 2.5 (7B parameters), and their reasoning-oriented distilled variants based on the DeepSeek-R1 architecture, as well as open-access commercial models such as GPT-3.5, GPT-4o, and Gemini. After evaluating the models’ accuracy against the official answer keys, GPT-4o emerged as the top-performing model, achieving an average accuracy of 63.85%. Next, GPT-4o was prompted to justify its answers to the 2024 exam, and its explanations were independently reviewed by three licensed physicians. The evaluators reported full agreement with the clinical reasoning presented, indicating the model’s ability to produce coherent and medically relevant justifications. Lastly, justifications generated by GPT-4o for correctly answered questions from previous exams (2017–2023) were compiled into a knowledge base, which was then used to enhance GPT-4o through retrieval-augmented generation and to fine-tune LLaMA 3.1, leading to measurable performance improvements on the 2024 exam. Despite promising performance, these models still demonstrate variability in responses, hallucinations, and limited reliability in high-stakes contexts. As such, their outputs should always be reviewed by qualified professionals, and human expertise remains essential in clinical decision-making and medical education scenarios, considering the PT-BR language. However, the observed gains from integrating prior exam content indicate that domain-specific adaptation strategies may help mitigate some of these limitations and enhance model alignment.https://www.mdpi.com/2076-3417/15/13/7134health informaticsartificial intelligencelarge language model, LLMsperformance analysismedical licensingmedical education |
spellingShingle | Renato Freitas Bessa Adonias Caetano de Oliveira Rafael Freitas Bessa Daniel Lima Sousa Rafaela Alves Amanda Barbosa Alinne Carneiro Carla Soares Ariel Soares Teles Performance Comparison of Large Language Models on Brazil’s Medical Revalidation Exam for Foreign-Trained Graduates Applied Sciences health informatics artificial intelligence large language model, LLMs performance analysis medical licensing medical education |
title | Performance Comparison of Large Language Models on Brazil’s Medical Revalidation Exam for Foreign-Trained Graduates |
title_full | Performance Comparison of Large Language Models on Brazil’s Medical Revalidation Exam for Foreign-Trained Graduates |
title_fullStr | Performance Comparison of Large Language Models on Brazil’s Medical Revalidation Exam for Foreign-Trained Graduates |
title_full_unstemmed | Performance Comparison of Large Language Models on Brazil’s Medical Revalidation Exam for Foreign-Trained Graduates |
title_short | Performance Comparison of Large Language Models on Brazil’s Medical Revalidation Exam for Foreign-Trained Graduates |
title_sort | performance comparison of large language models on brazil s medical revalidation exam for foreign trained graduates |
topic | health informatics artificial intelligence large language model, LLMs performance analysis medical licensing medical education |
url | https://www.mdpi.com/2076-3417/15/13/7134 |
work_keys_str_mv | AT renatofreitasbessa performancecomparisonoflargelanguagemodelsonbrazilsmedicalrevalidationexamforforeigntrainedgraduates AT adoniascaetanodeoliveira performancecomparisonoflargelanguagemodelsonbrazilsmedicalrevalidationexamforforeigntrainedgraduates AT rafaelfreitasbessa performancecomparisonoflargelanguagemodelsonbrazilsmedicalrevalidationexamforforeigntrainedgraduates AT daniellimasousa performancecomparisonoflargelanguagemodelsonbrazilsmedicalrevalidationexamforforeigntrainedgraduates AT rafaelaalves performancecomparisonoflargelanguagemodelsonbrazilsmedicalrevalidationexamforforeigntrainedgraduates AT amandabarbosa performancecomparisonoflargelanguagemodelsonbrazilsmedicalrevalidationexamforforeigntrainedgraduates AT alinnecarneiro performancecomparisonoflargelanguagemodelsonbrazilsmedicalrevalidationexamforforeigntrainedgraduates AT carlasoares performancecomparisonoflargelanguagemodelsonbrazilsmedicalrevalidationexamforforeigntrainedgraduates AT arielsoaresteles performancecomparisonoflargelanguagemodelsonbrazilsmedicalrevalidationexamforforeigntrainedgraduates |