Performance Comparison of Large Language Models on Brazil’s Medical Revalidation Exam for Foreign-Trained Graduates

This study aimed to compare the performance of various Large Language Models (LLMs) in answering multiple-choice questions from the last six editions (2017 to 2024) of Revalida exam. The evaluation focused on models capable of processing content in Brazilian Portuguese (PT-BR), including open-source...

Full description

Saved in:
Bibliographic Details
Main Authors: Renato Freitas Bessa, Adonias Caetano de Oliveira, Rafael Freitas Bessa, Daniel Lima Sousa, Rafaela Alves, Amanda Barbosa, Alinne Carneiro, Carla Soares, Ariel Soares Teles
Format: Article
Language:English
Published: MDPI AG 2025-06-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/13/7134
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This study aimed to compare the performance of various Large Language Models (LLMs) in answering multiple-choice questions from the last six editions (2017 to 2024) of Revalida exam. The evaluation focused on models capable of processing content in Brazilian Portuguese (PT-BR), including open-source models, namely LLaMA 3.1 (8B parameters), Qwen 2.5 (7B parameters), and their reasoning-oriented distilled variants based on the DeepSeek-R1 architecture, as well as open-access commercial models such as GPT-3.5, GPT-4o, and Gemini. After evaluating the models’ accuracy against the official answer keys, GPT-4o emerged as the top-performing model, achieving an average accuracy of 63.85%. Next, GPT-4o was prompted to justify its answers to the 2024 exam, and its explanations were independently reviewed by three licensed physicians. The evaluators reported full agreement with the clinical reasoning presented, indicating the model’s ability to produce coherent and medically relevant justifications. Lastly, justifications generated by GPT-4o for correctly answered questions from previous exams (2017–2023) were compiled into a knowledge base, which was then used to enhance GPT-4o through retrieval-augmented generation and to fine-tune LLaMA 3.1, leading to measurable performance improvements on the 2024 exam. Despite promising performance, these models still demonstrate variability in responses, hallucinations, and limited reliability in high-stakes contexts. As such, their outputs should always be reviewed by qualified professionals, and human expertise remains essential in clinical decision-making and medical education scenarios, considering the PT-BR language. However, the observed gains from integrating prior exam content indicate that domain-specific adaptation strategies may help mitigate some of these limitations and enhance model alignment.
ISSN:2076-3417