基于样本对语义主动挖掘的图文匹配算法

doi:10.12052/gdutxb.230122

摘要/Abstract

摘要： 针对目前基于共识学习的图文匹配算法无法有效匹配图像–文本样本对中难分的负样本,模型的泛化能力较弱,在大规模数据集上效果不佳等不足，本文提出了一种基于样本对语义主动挖掘的图文匹配模型。首先，提出的自适应分层强化损失具有多样化的学习模式，在传统的三元组损失基础上，增加具有预测性的候选实例(难以分辨的样本对)进行辅助训练。其主动学习模式通过一种惩罚机制来关注难分的负样本，以提高判别能力。此外，提出的模型还能自适应地从非真实标签样本中挖掘出更多隐藏的相关语义表征，从而提高了模型的性能和泛化能力。最后，在Flickr30K和MSCOCO公共数据集上的实验结果证明了该算法的有效性，其性能达到了目前先进水平。本方法有效地结合了图像文本两种模态，能有效提高自然语言搜索和视觉问题回答等应用的性能。

关键词: 图文匹配, 共识学习, 三元组损失, 难分的负对, 跨模态检索

Abstract: Aiming at the shortcomings that the existing image-text matching algorithms based on common-sense learning cannot effectively match the intractable negative samples in image-text sample pairs, and the generalization ability of the models is weak and ineffective on large-scale datasets, a novel image-text matching model called Active Mining Sample Pair Semantics image-text matching model is proposed. Firstly, the proposed Adaptive Hierarchical Reinforcement Loss has diversified learning modes, and on top of the traditional triple loss, predictive candidate instances (pairs of intractable sample pairs) are added to aid in training. Its active learning mode enables model to more focus on the intractable negative samples through a penalizing mechanism to enhance the discriminative ability. In addition, the proposed model can also adaptively mine more hidden relevant semantic representations from uncommented items, which greatly improves the performance and generalization ability of model. Finally, experimental results on Flickr30K and MSCOCO datasets show that this proposed method is superior to the existing advanced comparison methods.

Key words: image-text matching, common-sense learning, triplet loss, intractable negative pairs, cross-modal retrieval

中图分类号:

TP391.4

陈永锋, 刘劲, 杨志景, 陈锐涵, 谭俊鹏. 基于样本对语义主动挖掘的图文匹配算法[J]. 广东工业大学学报, 2024, 41(04): 89-97.doi: 10.12052/gdutxb.230122

Chen Yong-feng, Liu Jing, Yang Zhi-jing, Chen Rui-han, Tan Jun-peng. Active Mining Sample Pair Semantics for Image-text Matching[J]. Journal of Guangdong University of Technology, 2024, 41(04): 89-97.doi: 10.12052/gdutxb.230122

参考文献

[1] LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching[C]//Proceedings of the European Conference on Computer Vision (ECCV) . Munich: Springer International Publishing, 2018: 201-216.
[2] LI K, ZHANG Y, LI K, etal. Visual Semantic Reasoning for Image-Text Matching[C]//2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Seoul, Korea (South) : IEEE, 2019: . 4653-4661.
[3] 李濛. 结合知识的图文匹配算法研究与实现[D]. 成都: 电子科技大学, 2022.
[4] 任思宇. 基于语义推理的图文匹配方法的研究[D]. 天津: 天津工业大学, 2021.
[5] SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , Las Vegas: IEEE, 2016: 2818-2826.
[6] 林哲煌, 李东. 语义引导下自适应拓扑推理图卷积网络的人体动作识别[J]. 广东工业大学学报, 2023, 40(4): 45-52.
LIN Z H, LI D. Semantics-guided adaptive topology inference graph convolutional networks for skeleton-based action recognition. [J]. Journal of Guangdong University of Technology, 2023, 40(4): 45-52.
[7] LU R T, YANG X G, JING X, et al. Infrared small target detection based on local hypergraph dissimilarity measure [J]. IEEE Geoscience and Remote Sensing Letters, 2020, 19: 1-5.
[8] SUN J H, QING C M, TAN J P, et al. Superpoint transformer for 3d scene instance segmentation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Washington: AAAI, 2023, 37(2) : 2393-2401.
[9] FAGHRI F, FLEET D J, KIROS J R, et al. Improving visual-semantic embeddings with hard negatives[J]. arXiv: 1707.05612(2017-12-12) [2022-05-12]. https://arxiv.org/pdf/1707.05612.pdf.
[10] 林志刚. 基于多任务和注意力机制的图文匹配技术研究[D]. 天津: 天津大学, 2020.
[11] EISENSCHTAT A, WOLF L. Linking image and text with 2-way nets[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Honolulu: IEEE, 2017: 1855-1865.
[12] KARPATHY A , LI F. Deep visual-semantic alignments for generating image descriptions[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Boston: IEEE, 2015: 3128-3137.
[13] WANG J, ZHOU F, WEN S, et al. Deep metric learning with angular loss[C]//2017 IEEE International Conference on Computer Vision (ICCV) . Venice: IEEE, 2017: 2612-2620.
[14] TAN J P, YANG Z J, REN J C, et al. A novel robust low-rank multi-view diversity optimization model with adaptive-weighting based manifold learning [J]. Pattern Recognition, 2022, 122: 108298.
[15] WANG H R, ZHANG Y, JI Z, et al. Consensus-aware visual-semantic embedding for image-text matching[C]//Computer Vision—ECCV 2020: 16th European Conference. Glasgow: Springer International Publishing, 2020: 18-34.
[16] BOUKTHIR K, QAHTANI A M, ALMUTIRY O, et al. Reduced annotation based on deep active learning for arabic text detection in natural scene images [J]. Pattern Recognition Letters, 2022, 157: 42-48.
[17] DU P, CHEN H, ZHAO S Y, et al. Contrastive active learning under class distribution mismatch [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(4): 4260-4273.
[18] VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Boston: IEEE, 2015: 4566-4575.
[19] MATSUBARA T. Target-oriented deformation of visual-semantic embedding space [J]. IEICE Transactions on Information and Systems, 2021, 104(1): 24-33.
[20] LIU C, MAO Z, ZANG W, et al. A neighbor-aware approach for image-text matching[C]//ICASSP 2019—2019 IEEE International Conference on Acoustics. Brighton: IEEE, 2019: 3970-3974.
[21] LIU F Y, YE R T, WANG X, et al. Hal: improved text-image matching by mitigating visual semantic hubs[C]//Proceedings of the AAAI Conference on Artificial Intelligence. New York: IEEE 2020, 34(7) : 11563-11571.
[22] ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6077-6086.
[23] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks [J]. Advances in Neural Information Processing Systems, 2015, 28: 91-99.
[24] SCHUSTER M, PALIWAL K K. Bidirectional recurrent neural networks [J]. IEEE Transactions on Signal Processing, 1997, 45(11): 2673-2681.
[25] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [J]. Advances in Neural Information Processing Systems, 2017, 30: 6000-6010.
[26] NAM H, HA J W, KIM J. Dual attention networks for multimodal reasoning and matching[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Hawaii: IEEE, 2017: 299-307.
[27] CHEN T L, DENG J J, LUO J B. Adaptive offline quintuplet loss for image-text matching[C]//Computer Vision-ECCV 2020: 16th European Conference. Glasgow: Springer International Publishing, 2020: 549-565.
[28] DENKOWSKI M, LAVIE A. Meteor universal: language specific translation evaluation for any target language[C]//Proceedings of the Ninth Workshop on Statistical Machine Translation. Baltimore: ACL, 2014: 376-380.
[29] PlUMMER B A, WANG L, CERVANTES C M, et al. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models[C]//2015 IEEE International Conference on Computer Vision (ICCV) . Santiago, Chile: IEEE, 2015: 2641-2649.
[30] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: common objects in context[C]//Computer Vision-ECCV 2014: 13th European Conference. Zurich, Switzerland: Springer International Publishing, 2014: 740-755.
[31] BITEN A F, MAFLA A, GOMEZ L, et al. Is an image worth five sentences? A New look into semantics for image-text matching[C]//2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) . Waikoloa: IEEE, 2022: 2483-2492.
[32] HUANG Y, WU Q, SONG C, et al. Learning semantic concepts and order for image and sentence matching[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6163-6171.
[33] WANG Z. CAMP: cross-modal adaptive message passing for text-Image retrieval[C] //2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Seoul, Korea (South) : IEEE, 2019: 5763-5772.
[34] LIU C X, MAO Z D, LIU A A, et al. Focus your attention: A bidirectional focal attention network for image-text matching[C]//Proceedings of the 27th ACM International Conference on Multimedia. France: ACM, 2019: 3-11.
[35] LIU C, MAO Z, ZHANG T, et al. Graph structured network for image-text matching[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Seattle: IEEE, 2020: 10918-10927.
[36] DIAO H, ZHANG Y, MA L, et al. Similarity reasoning and filtration for image-text matching[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Internet: AAAI, 2021, 35(2) : 1218-1226.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed