PathVLM-Eval: Evaluation of open vision language models in histopathology

The emerging trend of vision language models (VLMs) has introduced a new paradigm in artificial intelligence (AI). However, their evaluation has predominantly focused on general-purpose datasets, providing a limited understanding of their effectiveness in specialized domains. Medical imaging, partic...

Full description

Saved in:
Bibliographic Details
Main Authors: Nauman Ullah Gilal, Rachida Zegour, Khaled Al-Thelaya, Erdener Özer, Marco Agus, Jens Schneider, Sabri Boughorbel
Format: Article
Language:English
Published: Elsevier 2025-08-01
Series:Journal of Pathology Informatics
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2153353925000409
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The emerging trend of vision language models (VLMs) has introduced a new paradigm in artificial intelligence (AI). However, their evaluation has predominantly focused on general-purpose datasets, providing a limited understanding of their effectiveness in specialized domains. Medical imaging, particularly digital pathology, could significantly benefit from VLMs for histological interpretation and diagnosis, enabling pathologists to use a complementary tool for faster morecomprehensive reporting and efficient healthcare service. In this work, we are interested in benchmarking VLMs on histopathology image understanding. We present an extensive evaluation of recent VLMs on the PathMMU dataset, a domain-specific benchmark that includes subsets such as PubMed, SocialPath, and EduContent. These datasets feature diverse formats, notably multiple-choice questions (MCQs), designed to aid pathologists in diagnostic reasoning and support professional development initiatives in histopathology. Utilizing VLMEvalKit, a widely used open-source evaluation framework—we bring publicly available pathology datasets under a single evaluation umbrella, ensuring unbiased and contamination-free assessments of model performance. Our study conducts extensive zero-shot evaluations of more than 60 state-of-the-art VLMs, including LLaVA, Qwen-VL, Qwen2-VL, InternVL, Phi3, Llama3, MOLMO, and XComposer series, significantly expanding the range of evaluated models compared to prior literature. Among the tested models, Qwen2-VL-72B-Instruct achieved superior performance with an average score of 63.97% outperforming other models across all PathMMU subsets. We conclude that this extensive evaluation will serve as a valuable resource, fostering the development of next-generation VLMs for analyzing digital pathology images. Additionally, we have released the complete evaluation results on our leaderboard PathVLM-Eval: https://huggingface.co/spaces/gilalnauman/PathVLMs.
ISSN:2153-3539