Academic case reports lack diversity: Assessing the presence and diversity of sociodemographic and behavioral factors related to Post COVID-19 Condition.

Understanding disparities in the prevalence of Post COVID-19 Condition (PCC) amongst vulnerable populations is crucial to improving care and addressing intersecting inequities. This study aims to develop a comprehensive framework for integrating social determinants of health (SDOH) into PCC research...

Full description

Saved in:
Bibliographic Details
Main Authors: Juan Andres Medina Florez, Shaina Raza, Rashida Lynn Ansell, Zahra Shakeri, Brendan T Smith, Elham Dolatabadi
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2025-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0326668
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1839634497048412160
author Juan Andres Medina Florez
Shaina Raza
Rashida Lynn Ansell
Zahra Shakeri
Brendan T Smith
Elham Dolatabadi
author_facet Juan Andres Medina Florez
Shaina Raza
Rashida Lynn Ansell
Zahra Shakeri
Brendan T Smith
Elham Dolatabadi
author_sort Juan Andres Medina Florez
collection DOAJ
description Understanding disparities in the prevalence of Post COVID-19 Condition (PCC) amongst vulnerable populations is crucial to improving care and addressing intersecting inequities. This study aims to develop a comprehensive framework for integrating social determinants of health (SDOH) into PCC research by leveraging natural language processing (NLP) techniques to analyze disparities and variations in SDOH representation within PCC case reports. Following construction of a PCC Case Report Corpus, comprising over 7,000 case reports from the LitCOVID repository, a subset of 709 reports were annotated with 26 core SDOH-related entity types using pre-trained named entity recognition (NER) models, human review, and data augmentation to improve quality, diversity and representation of entity types. An NLP pipeline integrating NER, natural language inference (NLI), trigram and frequency analyses was developed to extract and analyze these entities. Both encoder-only transformer models and RNN-based models were assessed for the NER objective. Fine-tuned encoder-only BERT models outperformed traditional RNN-based models in generalizability to distinct sentence structures and greater class sparsity, achieving a macro F1-score of 0.72 and macro AUC of 0.99 on a held-out generalization set. Exploratory analysis revealed variability in entity richness, with prevalent entities like condition, age, and access to care, and under-representation of sensitive categories like race and housing status. Trigram analysis highlighted frequent co-occurrences among entities, including age, gender, and condition. The NLI objective (entailment and contradiction analysis) showed attributes like "Experienced violence or abuse" and "Has medical insurance" had high entailment rates (82.4%-80.3%), while attributes such as "Is female-identifying," "Is married," and "Has a terminal condition" exhibited high contradiction rates (70.8%-98.5%). Our results highlight the effectiveness of transformer-based NER in extracting SDOH information from case reports. However, the findings also expose critical gaps in the representation of marginalized groups within PCC-related academic case reports, e.g., across gender, insurance status, and age. This work underscores the need for standardized SDOH documentation and inclusive reporting practices to enable more equitable research and inform future health policy and AI model development.
format Article
id doaj-art-dfb92c8eb9a04b15aa4dfc23c38b600f
institution Matheson Library
issn 1932-6203
language English
publishDate 2025-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj-art-dfb92c8eb9a04b15aa4dfc23c38b600f2025-07-10T05:31:17ZengPublic Library of Science (PLoS)PLoS ONE1932-62032025-01-01207e032666810.1371/journal.pone.0326668Academic case reports lack diversity: Assessing the presence and diversity of sociodemographic and behavioral factors related to Post COVID-19 Condition.Juan Andres Medina FlorezShaina RazaRashida Lynn AnsellZahra ShakeriBrendan T SmithElham DolatabadiUnderstanding disparities in the prevalence of Post COVID-19 Condition (PCC) amongst vulnerable populations is crucial to improving care and addressing intersecting inequities. This study aims to develop a comprehensive framework for integrating social determinants of health (SDOH) into PCC research by leveraging natural language processing (NLP) techniques to analyze disparities and variations in SDOH representation within PCC case reports. Following construction of a PCC Case Report Corpus, comprising over 7,000 case reports from the LitCOVID repository, a subset of 709 reports were annotated with 26 core SDOH-related entity types using pre-trained named entity recognition (NER) models, human review, and data augmentation to improve quality, diversity and representation of entity types. An NLP pipeline integrating NER, natural language inference (NLI), trigram and frequency analyses was developed to extract and analyze these entities. Both encoder-only transformer models and RNN-based models were assessed for the NER objective. Fine-tuned encoder-only BERT models outperformed traditional RNN-based models in generalizability to distinct sentence structures and greater class sparsity, achieving a macro F1-score of 0.72 and macro AUC of 0.99 on a held-out generalization set. Exploratory analysis revealed variability in entity richness, with prevalent entities like condition, age, and access to care, and under-representation of sensitive categories like race and housing status. Trigram analysis highlighted frequent co-occurrences among entities, including age, gender, and condition. The NLI objective (entailment and contradiction analysis) showed attributes like "Experienced violence or abuse" and "Has medical insurance" had high entailment rates (82.4%-80.3%), while attributes such as "Is female-identifying," "Is married," and "Has a terminal condition" exhibited high contradiction rates (70.8%-98.5%). Our results highlight the effectiveness of transformer-based NER in extracting SDOH information from case reports. However, the findings also expose critical gaps in the representation of marginalized groups within PCC-related academic case reports, e.g., across gender, insurance status, and age. This work underscores the need for standardized SDOH documentation and inclusive reporting practices to enable more equitable research and inform future health policy and AI model development.https://doi.org/10.1371/journal.pone.0326668
spellingShingle Juan Andres Medina Florez
Shaina Raza
Rashida Lynn Ansell
Zahra Shakeri
Brendan T Smith
Elham Dolatabadi
Academic case reports lack diversity: Assessing the presence and diversity of sociodemographic and behavioral factors related to Post COVID-19 Condition.
PLoS ONE
title Academic case reports lack diversity: Assessing the presence and diversity of sociodemographic and behavioral factors related to Post COVID-19 Condition.
title_full Academic case reports lack diversity: Assessing the presence and diversity of sociodemographic and behavioral factors related to Post COVID-19 Condition.
title_fullStr Academic case reports lack diversity: Assessing the presence and diversity of sociodemographic and behavioral factors related to Post COVID-19 Condition.
title_full_unstemmed Academic case reports lack diversity: Assessing the presence and diversity of sociodemographic and behavioral factors related to Post COVID-19 Condition.
title_short Academic case reports lack diversity: Assessing the presence and diversity of sociodemographic and behavioral factors related to Post COVID-19 Condition.
title_sort academic case reports lack diversity assessing the presence and diversity of sociodemographic and behavioral factors related to post covid 19 condition
url https://doi.org/10.1371/journal.pone.0326668
work_keys_str_mv AT juanandresmedinaflorez academiccasereportslackdiversityassessingthepresenceanddiversityofsociodemographicandbehavioralfactorsrelatedtopostcovid19condition
AT shainaraza academiccasereportslackdiversityassessingthepresenceanddiversityofsociodemographicandbehavioralfactorsrelatedtopostcovid19condition
AT rashidalynnansell academiccasereportslackdiversityassessingthepresenceanddiversityofsociodemographicandbehavioralfactorsrelatedtopostcovid19condition
AT zahrashakeri academiccasereportslackdiversityassessingthepresenceanddiversityofsociodemographicandbehavioralfactorsrelatedtopostcovid19condition
AT brendantsmith academiccasereportslackdiversityassessingthepresenceanddiversityofsociodemographicandbehavioralfactorsrelatedtopostcovid19condition
AT elhamdolatabadi academiccasereportslackdiversityassessingthepresenceanddiversityofsociodemographicandbehavioralfactorsrelatedtopostcovid19condition