Image Captioning with End-to-end Attribute Detection and Subsequent Attributes Prediction.

Huang; Yiqing;Chen; Jiansheng;Ouyang; Wanli;Wan; Weitao;Xue; Youze;

doi:10.1109/TIP.2020.2969330

Image Captioning with End-to-end Attribute Detection and Subsequent Attributes Prediction.

Clicks: 266

ID: 91752

2020

Article Quality & Performance Metrics

Overall Quality Improving Quality

0.0 /100

Combines engagement data with AI-assessed academic quality

Reader Engagement Steady Performance

30.0 /100

265 views

48 readers

AI Quality Assessment

Not analyzed

Abstract

EN
- Turkish
- Spanish
- Portuguese
- Arabic
- Chinese
- French
- German
- Indonesian
- Russian
- Thai

Semantic attention has been shown to be effective in improving the performance of image captioning. The core of semantic attention based methods is to drive the model to attend to semantically important words, or attributes. In previous works, the attribute detector and the captioning network are usually independent, leading to the insufficient usage of the semantic information. Also, all the detected attributes, no matter whether they are appropriate for the linguistic context at the current step, are attended to through the whole caption generation process. This may sometimes disrupt the captioning model to attend to incorrect visual concepts. To solve these problems, we introduce two end-to-end trainable modules to closely couple attribute detection with image captioning as well as prompt the effective uses of attributes by predicting appropriate attributes at each time step. The multimodal attribute detector (MAD) module improves the attribute detection accuracy by using not only the image features but also the word embedding of attributes already existing in most captioning models. MAD models the similarity between the semantics of attributes and the image object features to facilitate accurate detection. The subsequent attribute predictor (SAP) module dynamically predicts a concise attribute subset at each time step to mitigate the diversity of image attributes. Compared to previous attribute based methods, our approach enhances the explainability in how the attributes affect the generated words and achieves a state-of-the-art single model performance of 128.8 CIDEr-D on the MSCOCO dataset. Extensive experiments on the MSCOCO dataset show that our proposal actually improves the performances in both image captioning and attribute detection simultaneously. The codes are available at: https://github.com/ RubickH/Image-Captioning-with-MAD-and-SAP.

Reference Key	huang2020imageieee Use this key to autocite in the manuscript while using SciMatic Manuscript Manager or Thesis Manager
Authors	Huang, Yiqing;Chen, Jiansheng;Ouyang, Wanli;Wan, Weitao;Xue, Youze;
Journal	ieee transactions on image processing : a publication of the ieee signal processing society
Year	2020
DOI	10.1109/TIP.2020.2969330 Searching for DOI...
URL	https://doi.org/10.1109/TIP.2020.2969330
Keywords	social determinants of health racial/ethnic disparities black/african american heterosexual men hiv treatment and care sustained viral suppression

Citations

No citations found. To add a citation, contact the admin at info@scimatic.org

Comments

Login to comment Register

No comments yet. Be the first to comment on this article.