Active Learning for Medical Article Classification with Bag of Words and Bag of Concepts Embeddings

Systems supporting systematic literature reviews often use machine learning algorithms to create classification models to assess the relevance of articles to study topics. The proper choice of text representation for such algorithms may have a significant impact on their predictive performance. This...

Full description

Saved in:
Bibliographic Details
Main Authors: Radosław Pytlak, Paweł Cichosz, Bartłomiej Fajdek, Bogdan Jastrzębski
Format: Article
Language:English
Published: MDPI AG 2025-07-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/14/7955
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Systems supporting systematic literature reviews often use machine learning algorithms to create classification models to assess the relevance of articles to study topics. The proper choice of text representation for such algorithms may have a significant impact on their predictive performance. This article presents an in-depth investigation of the utility of the bag of concepts representation for this purpose, which can be considered an enhanced form of the ubiquitous bag of words representation, with features corresponding to ontology concepts rather than words. Its utility is evaluated in the active learning setting, in which a sequence of classification models is created, with training data iteratively expanded by adding articles selected for human screening. Different versions of the bag of concepts are compared with bag of words, as well as with combined representations, including both word-based and concept-based features. The evaluation uses the support vector machine, naive Bayes, and random forest algorithms and is performed on datasets from 15 systematic medical literature review studies. The results show that concept-based features may have additional predictive value in comparison to standard word-based features and that the combined bag of concepts and bag of words representation is the most useful overall.
ISSN:2076-3417