Integrating Abstract Meaning Representation to Enhance Transformer-Based Image Captioning

Although recent image captioning models have achieved substantial progress, they still encounter limitations in capturing abstract semantics, resulting in insufficient semantic depth and limited diversity in expression. Meanwhile, Abstract Meaning Representation (AMR), a form of abstract semantic re...

Full description

Saved in:
Bibliographic Details
Main Authors: Nguyen Van Thinh, Tran Lang, Van The Thanh
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11058972/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Although recent image captioning models have achieved substantial progress, they still encounter limitations in capturing abstract semantics, resulting in insufficient semantic depth and limited diversity in expression. Meanwhile, Abstract Meaning Representation (AMR), a form of abstract semantic representation, has been successfully applied in various natural language processing tasks. However, exploiting AMR in multimodal contexts, particularly for image captioning, remains largely unexplored. To address these limitations, this paper proposes a novel image captioning model within an encoder-decoder framework that leverages the abstract semantics of images through AMR. Specifically, AMR is incorporated into the model in two ways: 1) extracting AMR from ground-truth captions and 2) converting the image’s relational graph into an AMR-like graph to enrich abstract semantics. These AMR embeddings are fused with object-region features and relational-graph embeddings via a cross-modal attention mechanism. Additionally, embeddings from the AMR-like graph are integrated into the Transformer decoder using a masked multi-head attention mechanism to enhance semantic coherence during caption generation. Experimental results on the MS COCO and Flickr30k datasets demonstrate that the proposed model achieves superior captioning accuracy compared to recent state-of-the-art methods, confirming the effectiveness of incorporating AMR in image captioning tasks.
ISSN:2169-3536