Performance Comparison of Large Language Models on Brazil’s Medical Revalidation Exam for Foreign-Trained Graduates

This study aimed to compare the performance of various Large Language Models (LLMs) in answering multiple-choice questions from the last six editions (2017 to 2024) of Revalida exam. The evaluation focused on models capable of processing content in Brazilian Portuguese (PT-BR), including open-source...

Full description

Saved in:
Bibliographic Details
Main Authors: Renato Freitas Bessa, Adonias Caetano de Oliveira, Rafael Freitas Bessa, Daniel Lima Sousa, Rafaela Alves, Amanda Barbosa, Alinne Carneiro, Carla Soares, Ariel Soares Teles
Format: Article
Language:English
Published: MDPI AG 2025-06-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/13/7134
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1839632370595004416
author Renato Freitas Bessa
Adonias Caetano de Oliveira
Rafael Freitas Bessa
Daniel Lima Sousa
Rafaela Alves
Amanda Barbosa
Alinne Carneiro
Carla Soares
Ariel Soares Teles
author_facet Renato Freitas Bessa
Adonias Caetano de Oliveira
Rafael Freitas Bessa
Daniel Lima Sousa
Rafaela Alves
Amanda Barbosa
Alinne Carneiro
Carla Soares
Ariel Soares Teles
author_sort Renato Freitas Bessa
collection DOAJ
description This study aimed to compare the performance of various Large Language Models (LLMs) in answering multiple-choice questions from the last six editions (2017 to 2024) of Revalida exam. The evaluation focused on models capable of processing content in Brazilian Portuguese (PT-BR), including open-source models, namely LLaMA 3.1 (8B parameters), Qwen 2.5 (7B parameters), and their reasoning-oriented distilled variants based on the DeepSeek-R1 architecture, as well as open-access commercial models such as GPT-3.5, GPT-4o, and Gemini. After evaluating the models’ accuracy against the official answer keys, GPT-4o emerged as the top-performing model, achieving an average accuracy of 63.85%. Next, GPT-4o was prompted to justify its answers to the 2024 exam, and its explanations were independently reviewed by three licensed physicians. The evaluators reported full agreement with the clinical reasoning presented, indicating the model’s ability to produce coherent and medically relevant justifications. Lastly, justifications generated by GPT-4o for correctly answered questions from previous exams (2017–2023) were compiled into a knowledge base, which was then used to enhance GPT-4o through retrieval-augmented generation and to fine-tune LLaMA 3.1, leading to measurable performance improvements on the 2024 exam. Despite promising performance, these models still demonstrate variability in responses, hallucinations, and limited reliability in high-stakes contexts. As such, their outputs should always be reviewed by qualified professionals, and human expertise remains essential in clinical decision-making and medical education scenarios, considering the PT-BR language. However, the observed gains from integrating prior exam content indicate that domain-specific adaptation strategies may help mitigate some of these limitations and enhance model alignment.
format Article
id doaj-art-1e69fd897d1f47fab97f3d341962adc6
institution Matheson Library
issn 2076-3417
language English
publishDate 2025-06-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj-art-1e69fd897d1f47fab97f3d341962adc62025-07-11T14:35:46ZengMDPI AGApplied Sciences2076-34172025-06-011513713410.3390/app15137134Performance Comparison of Large Language Models on Brazil’s Medical Revalidation Exam for Foreign-Trained GraduatesRenato Freitas Bessa0Adonias Caetano de Oliveira1Rafael Freitas Bessa2Daniel Lima Sousa3Rafaela Alves4Amanda Barbosa5Alinne Carneiro6Carla Soares7Ariel Soares Teles8Postgraduate Program in Biotechnology, Parnaíba Delta Federal University, Parnaíba 64202-020, BrazilPostgraduate Program in Biotechnology, Parnaíba Delta Federal University, Parnaíba 64202-020, BrazilPostgraduate Program in Biotechnology, Parnaíba Delta Federal University, Parnaíba 64202-020, BrazilPostgraduate Program in Biotechnology, Parnaíba Delta Federal University, Parnaíba 64202-020, BrazilFaculty of Human, Exact and Health Sciences, Parnaíba Valley Higher Education Institute (IESVAP/AFYA), Parnaíba 64212-790, BrazilParnaíba Unit, Universidade Paulista (UNIP), Parnaíba 01311-000, BrazilFaculty of Human, Exact and Health Sciences, Parnaíba Valley Higher Education Institute (IESVAP/AFYA), Parnaíba 64212-790, BrazilFaculty of Human, Exact and Health Sciences, Parnaíba Valley Higher Education Institute (IESVAP/AFYA), Parnaíba 64212-790, BrazilPostgraduate Program in Biotechnology, Parnaíba Delta Federal University, Parnaíba 64202-020, BrazilThis study aimed to compare the performance of various Large Language Models (LLMs) in answering multiple-choice questions from the last six editions (2017 to 2024) of Revalida exam. The evaluation focused on models capable of processing content in Brazilian Portuguese (PT-BR), including open-source models, namely LLaMA 3.1 (8B parameters), Qwen 2.5 (7B parameters), and their reasoning-oriented distilled variants based on the DeepSeek-R1 architecture, as well as open-access commercial models such as GPT-3.5, GPT-4o, and Gemini. After evaluating the models’ accuracy against the official answer keys, GPT-4o emerged as the top-performing model, achieving an average accuracy of 63.85%. Next, GPT-4o was prompted to justify its answers to the 2024 exam, and its explanations were independently reviewed by three licensed physicians. The evaluators reported full agreement with the clinical reasoning presented, indicating the model’s ability to produce coherent and medically relevant justifications. Lastly, justifications generated by GPT-4o for correctly answered questions from previous exams (2017–2023) were compiled into a knowledge base, which was then used to enhance GPT-4o through retrieval-augmented generation and to fine-tune LLaMA 3.1, leading to measurable performance improvements on the 2024 exam. Despite promising performance, these models still demonstrate variability in responses, hallucinations, and limited reliability in high-stakes contexts. As such, their outputs should always be reviewed by qualified professionals, and human expertise remains essential in clinical decision-making and medical education scenarios, considering the PT-BR language. However, the observed gains from integrating prior exam content indicate that domain-specific adaptation strategies may help mitigate some of these limitations and enhance model alignment.https://www.mdpi.com/2076-3417/15/13/7134health informaticsartificial intelligencelarge language model, LLMsperformance analysismedical licensingmedical education
spellingShingle Renato Freitas Bessa
Adonias Caetano de Oliveira
Rafael Freitas Bessa
Daniel Lima Sousa
Rafaela Alves
Amanda Barbosa
Alinne Carneiro
Carla Soares
Ariel Soares Teles
Performance Comparison of Large Language Models on Brazil’s Medical Revalidation Exam for Foreign-Trained Graduates
Applied Sciences
health informatics
artificial intelligence
large language model, LLMs
performance analysis
medical licensing
medical education
title Performance Comparison of Large Language Models on Brazil’s Medical Revalidation Exam for Foreign-Trained Graduates
title_full Performance Comparison of Large Language Models on Brazil’s Medical Revalidation Exam for Foreign-Trained Graduates
title_fullStr Performance Comparison of Large Language Models on Brazil’s Medical Revalidation Exam for Foreign-Trained Graduates
title_full_unstemmed Performance Comparison of Large Language Models on Brazil’s Medical Revalidation Exam for Foreign-Trained Graduates
title_short Performance Comparison of Large Language Models on Brazil’s Medical Revalidation Exam for Foreign-Trained Graduates
title_sort performance comparison of large language models on brazil s medical revalidation exam for foreign trained graduates
topic health informatics
artificial intelligence
large language model, LLMs
performance analysis
medical licensing
medical education
url https://www.mdpi.com/2076-3417/15/13/7134
work_keys_str_mv AT renatofreitasbessa performancecomparisonoflargelanguagemodelsonbrazilsmedicalrevalidationexamforforeigntrainedgraduates
AT adoniascaetanodeoliveira performancecomparisonoflargelanguagemodelsonbrazilsmedicalrevalidationexamforforeigntrainedgraduates
AT rafaelfreitasbessa performancecomparisonoflargelanguagemodelsonbrazilsmedicalrevalidationexamforforeigntrainedgraduates
AT daniellimasousa performancecomparisonoflargelanguagemodelsonbrazilsmedicalrevalidationexamforforeigntrainedgraduates
AT rafaelaalves performancecomparisonoflargelanguagemodelsonbrazilsmedicalrevalidationexamforforeigntrainedgraduates
AT amandabarbosa performancecomparisonoflargelanguagemodelsonbrazilsmedicalrevalidationexamforforeigntrainedgraduates
AT alinnecarneiro performancecomparisonoflargelanguagemodelsonbrazilsmedicalrevalidationexamforforeigntrainedgraduates
AT carlasoares performancecomparisonoflargelanguagemodelsonbrazilsmedicalrevalidationexamforforeigntrainedgraduates
AT arielsoaresteles performancecomparisonoflargelanguagemodelsonbrazilsmedicalrevalidationexamforforeigntrainedgraduates