Evaluation of Perceptual Realism and Clinical Plausibility of AI-Generated Colon Polyp Images
<b>Background:</b> Synthetic and pseudosynthetic images can be used to extend colonoscopy datasets, which, in turn, are used to train AI-detection models, yet their clinical acceptability depends on whether medical professionals can still recognize non-real content. <b>Aim:</b&g...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2025-06-01
|
Series: | Biomedicines |
Subjects: | |
Online Access: | https://www.mdpi.com/2227-9059/13/7/1561 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | <b>Background:</b> Synthetic and pseudosynthetic images can be used to extend colonoscopy datasets, which, in turn, are used to train AI-detection models, yet their clinical acceptability depends on whether medical professionals can still recognize non-real content. <b>Aim:</b> To quantify the ability of practicing gastroenterologists to discriminate real, pseudosynthetic, and synthetic polyp images and to determine how training level and synthesis method impact detection. <b>Materials and Methods:</b> A total of 32 Romanian gastroenterologists (18 residents and 14 seniors) reviewed 24 images (8 real, 8 augmented, 4 CycleGAN, and 4 diffusion) via an online form. Classification accuracy, 95% confidence intervals (CI), class sensitivity and precision, 3 × 3 confusion matrices, and Fleiss’ κ were calculated. Resident vs. senior differences were tested with Pearson χ<sup>2</sup>; CycleGAN versus diffusion detectability was analyzed with the Wilcoxon signed-rank test (α = 0.05). <b>Results:</b> Overall accuracy was 61.2% (95% CI 57.7–64.6). Residents and seniors performed similarly (62.3% vs. 59.8%; χ<sup>2</sup><sub>1</sub> = 0.38, <i>p</i> = 0.54). Sensitivity/precision were 70.7%/62.2% for real, 51.6%/58.9% for augmented, and 61.3%/62.1% for synthetic images. Collapsing to “real vs. non-real” yielded 70.7% sensitivity and 78.5% specificity for real images. CycleGAN images were always recognized as synthetic (128/128; 97.1–100% CI), whereas diffusion images were correctly classified only 22.7% of the time (16.3–30.6%; Wilcoxon <i>p</i> < 0.001). The training level did not impact detection performance (χ<sup>2</sup><sub>2</sub> < 1.2, <i>p</i> > 0.5). Inter-rater agreement was fair (κ = 0.30, 95% CI 0.15–0.43). <b>Conclusions:</b> Clinicians detect non-real colonoscopy images only slightly above chance, irrespective of experience. The diffusion synthesis method creates images that escape human scrutiny, suggesting the need for automated authenticity safeguards before synthetic datasets are applied in clinical or AI-validation contexts. |
---|---|
ISSN: | 2227-9059 |