广东工业大学学报 ›› 2024, Vol. 41 ›› Issue (04): 89-97.doi: 10.12052/gdutxb.230122

• 计算机科学与技术 • 上一篇    

基于样本对语义主动挖掘的图文匹配算法

陈永锋, 刘劲, 杨志景, 陈锐涵, 谭俊鹏   

  1. 广东工业大学 信息工程学院, 广东 广州 510006
  • 收稿日期:2023-08-29 发布日期:2024-08-13
  • 通信作者: 杨志景(1980–),男,教授,主要研究方向为计算机视觉及信息检索,E-mail:yzhj@gdut.edu.cn
  • 作者简介:陈永锋(1998–),男,硕士研究生,主要研究方向为计算机视觉,E-mail:2112103118@mail2.gdut.edu.cn
  • 基金资助:
    广东省自然科学基金资助项目(2021A1515011341, 2023A1515012561);广东省数字孪生人重点实验室项目(2022B1212010004)

Active Mining Sample Pair Semantics for Image-text Matching

Chen Yong-feng, Liu Jing, Yang Zhi-jing, Chen Rui-han, Tan Jun-peng   

  1. School of Information Engineering, Guangdong University of Technology, Guangzhou 510006, China
  • Received:2023-08-29 Published:2024-08-13

摘要: 针对目前基于共识学习的图文匹配算法无法有效匹配图像–文本样本对中难分的负样本,模型的泛化能力较弱,在大规模数据集上效果不佳等不足,本文提出了一种基于样本对语义主动挖掘的图文匹配模型。首先,提出的自适应分层强化损失具有多样化的学习模式,在传统的三元组损失基础上,增加具有预测性的候选实例(难以分辨的样本对)进行辅助训练。其主动学习模式通过一种惩罚机制来关注难分的负样本,以提高判别能力。此外,提出的模型还能自适应地从非真实标签样本中挖掘出更多隐藏的相关语义表征,从而提高了模型的性能和泛化能力。最后,在Flickr30K和MSCOCO公共数据集上的实验结果证明了该算法的有效性,其性能达到了目前先进水平。本方法有效地结合了图像文本两种模态,能有效提高自然语言搜索和视觉问题回答等应用的性能。

关键词: 图文匹配, 共识学习, 三元组损失, 难分的负对, 跨模态检索

Abstract: Aiming at the shortcomings that the existing image-text matching algorithms based on common-sense learning cannot effectively match the intractable negative samples in image-text sample pairs, and the generalization ability of the models is weak and ineffective on large-scale datasets, a novel image-text matching model called Active Mining Sample Pair Semantics image-text matching model is proposed. Firstly, the proposed Adaptive Hierarchical Reinforcement Loss has diversified learning modes, and on top of the traditional triple loss, predictive candidate instances (pairs of intractable sample pairs) are added to aid in training. Its active learning mode enables model to more focus on the intractable negative samples through a penalizing mechanism to enhance the discriminative ability. In addition, the proposed model can also adaptively mine more hidden relevant semantic representations from uncommented items, which greatly improves the performance and generalization ability of model. Finally, experimental results on Flickr30K and MSCOCO datasets show that this proposed method is superior to the existing advanced comparison methods.

Key words: image-text matching, common-sense learning, triplet loss, intractable negative pairs, cross-modal retrieval

中图分类号: 

  • TP391.4
[1] LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching[C]//Proceedings of the European Conference on Computer Vision (ECCV) . Munich: Springer International Publishing, 2018: 201-216.
[2] LI K, ZHANG Y, LI K, etal. Visual Semantic Reasoning for Image-Text Matching[C]//2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Seoul, Korea (South) : IEEE, 2019: . 4653-4661.
[3] 李濛. 结合知识的图文匹配算法研究与实现[D]. 成都: 电子科技大学, 2022.
[4] 任思宇. 基于语义推理的图文匹配方法的研究[D]. 天津: 天津工业大学, 2021.
[5] SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , Las Vegas: IEEE, 2016: 2818-2826.
[6] 林哲煌, 李东. 语义引导下自适应拓扑推理图卷积网络的人体动作识别[J]. 广东工业大学学报, 2023, 40(4): 45-52.
LIN Z H, LI D. Semantics-guided adaptive topology inference graph convolutional networks for skeleton-based action recognition. [J]. Journal of Guangdong University of Technology, 2023, 40(4): 45-52.
[7] LU R T, YANG X G, JING X, et al. Infrared small target detection based on local hypergraph dissimilarity measure [J]. IEEE Geoscience and Remote Sensing Letters, 2020, 19: 1-5.
[8] SUN J H, QING C M, TAN J P, et al. Superpoint transformer for 3d scene instance segmentation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Washington: AAAI, 2023, 37(2) : 2393-2401.
[9] FAGHRI F, FLEET D J, KIROS J R, et al. Improving visual-semantic embeddings with hard negatives[J]. arXiv: 1707.05612(2017-12-12) [2022-05-12]. https://arxiv.org/pdf/1707.05612.pdf.
[10] 林志刚. 基于多任务和注意力机制的图文匹配技术研究[D]. 天津: 天津大学, 2020.
[11] EISENSCHTAT A, WOLF L. Linking image and text with 2-way nets[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Honolulu: IEEE, 2017: 1855-1865.
[12] KARPATHY A , LI F. Deep visual-semantic alignments for generating image descriptions[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Boston: IEEE, 2015: 3128-3137.
[13] WANG J, ZHOU F, WEN S, et al. Deep metric learning with angular loss[C]//2017 IEEE International Conference on Computer Vision (ICCV) . Venice: IEEE, 2017: 2612-2620.
[14] TAN J P, YANG Z J, REN J C, et al. A novel robust low-rank multi-view diversity optimization model with adaptive-weighting based manifold learning [J]. Pattern Recognition, 2022, 122: 108298.
[15] WANG H R, ZHANG Y, JI Z, et al. Consensus-aware visual-semantic embedding for image-text matching[C]//Computer Vision—ECCV 2020: 16th European Conference. Glasgow: Springer International Publishing, 2020: 18-34.
[16] BOUKTHIR K, QAHTANI A M, ALMUTIRY O, et al. Reduced annotation based on deep active learning for arabic text detection in natural scene images [J]. Pattern Recognition Letters, 2022, 157: 42-48.
[17] DU P, CHEN H, ZHAO S Y, et al. Contrastive active learning under class distribution mismatch [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(4): 4260-4273.
[18] VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Boston: IEEE, 2015: 4566-4575.
[19] MATSUBARA T. Target-oriented deformation of visual-semantic embedding space [J]. IEICE Transactions on Information and Systems, 2021, 104(1): 24-33.
[20] LIU C, MAO Z, ZANG W, et al. A neighbor-aware approach for image-text matching[C]//ICASSP 2019—2019 IEEE International Conference on Acoustics. Brighton: IEEE, 2019: 3970-3974.
[21] LIU F Y, YE R T, WANG X, et al. Hal: improved text-image matching by mitigating visual semantic hubs[C]//Proceedings of the AAAI Conference on Artificial Intelligence. New York: IEEE 2020, 34(7) : 11563-11571.
[22] ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6077-6086.
[23] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks [J]. Advances in Neural Information Processing Systems, 2015, 28: 91-99.
[24] SCHUSTER M, PALIWAL K K. Bidirectional recurrent neural networks [J]. IEEE Transactions on Signal Processing, 1997, 45(11): 2673-2681.
[25] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [J]. Advances in Neural Information Processing Systems, 2017, 30: 6000-6010.
[26] NAM H, HA J W, KIM J. Dual attention networks for multimodal reasoning and matching[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Hawaii: IEEE, 2017: 299-307.
[27] CHEN T L, DENG J J, LUO J B. Adaptive offline quintuplet loss for image-text matching[C]//Computer Vision-ECCV 2020: 16th European Conference. Glasgow: Springer International Publishing, 2020: 549-565.
[28] DENKOWSKI M, LAVIE A. Meteor universal: language specific translation evaluation for any target language[C]//Proceedings of the Ninth Workshop on Statistical Machine Translation. Baltimore: ACL, 2014: 376-380.
[29] PlUMMER B A, WANG L, CERVANTES C M, et al. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models[C]//2015 IEEE International Conference on Computer Vision (ICCV) . Santiago, Chile: IEEE, 2015: 2641-2649.
[30] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: common objects in context[C]//Computer Vision-ECCV 2014: 13th European Conference. Zurich, Switzerland: Springer International Publishing, 2014: 740-755.
[31] BITEN A F, MAFLA A, GOMEZ L, et al. Is an image worth five sentences? A New look into semantics for image-text matching[C]//2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) . Waikoloa: IEEE, 2022: 2483-2492.
[32] HUANG Y, WU Q, SONG C, et al. Learning semantic concepts and order for image and sentence matching[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6163-6171.
[33] WANG Z. CAMP: cross-modal adaptive message passing for text-Image retrieval[C] //2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Seoul, Korea (South) : IEEE, 2019: 5763-5772.
[34] LIU C X, MAO Z D, LIU A A, et al. Focus your attention: A bidirectional focal attention network for image-text matching[C]//Proceedings of the 27th ACM International Conference on Multimedia. France: ACM, 2019: 3-11.
[35] LIU C, MAO Z, ZHANG T, et al. Graph structured network for image-text matching[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Seattle: IEEE, 2020: 10918-10927.
[36] DIAO H, ZHANG Y, MA L, et al. Similarity reasoning and filtration for image-text matching[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Internet: AAAI, 2021, 35(2) : 1218-1226.
[1] 李雪森, 谭北海, 余荣, 薛先斌. 基于YOLOv5的轻量化无人机航拍小目标检测算法[J]. 广东工业大学学报, 2024, 41(03): 71-80.
[2] 曾安, 陈旭宙, 姬玉柱, 潘丹, 徐小维. 基于自注意力和三维卷积的心脏多类分割方法[J]. 广东工业大学学报, 2023, 40(06): 168-175.
[3] 吴晓鸰, 陈祥旺, 占文韬, 凌捷. 基于门控注意力单元的中文医学命名实体识别[J]. 广东工业大学学报, 2023, 40(06): 176-184.
[4] 吴圳桦, 唐文艳, 吕文阁, 陈汝杰, 侯梦华, 李德源. 基于ISSA和积分图的二维熵图像多阈值分割快速算法[J]. 广东工业大学学报, 2023, 40(05): 47-55.
[5] 钟耿君, 李东. 基于通道分离机制的双分支点云处理网络[J]. 广东工业大学学报, 2023, 40(04): 18-23.
[6] 林哲煌, 李东. 语义引导下自适应拓扑推理图卷积网络的人体动作识别[J]. 广东工业大学学报, 2023, 40(04): 45-52.
[7] 黄晓湧, 李伟彤. 基于TSSI和STB-CNN的跌倒检测算法[J]. 广东工业大学学报, 2023, 40(04): 53-59.
[8] 陈晓荣, 杨雪荣, 成思源, 刘国栋. 基于改进Unet网络的锂电池极片表面缺陷检测[J]. 广东工业大学学报, 2023, 40(04): 60-66,93.
[9] 曹智雄, 吴晓鸰, 骆晓伟, 凌捷. 融合迁移学习与YOLOv5的安全帽佩戴检测算法[J]. 广东工业大学学报, 2023, 40(04): 67-76.
[10] 谢国波, 林立, 林志毅, 贺笛轩, 文刚. 基于YOLOv4-MP的绝缘子爆裂缺陷检测方法[J]. 广东工业大学学报, 2023, 40(02): 15-21.
[11] 邹恒, 高军礼, 张树文, 宋海涛. 围棋机器人落子指引装置的设计与实现[J]. 广东工业大学学报, 2023, 40(01): 77-82,91.
[12] 易闽琦, 刘洪伟, 高鸿铭. 电商平台产品共同购买网络的影响因素研究[J]. 广东工业大学学报, 2022, 39(03): 16-24.
[13] 丘展春, 费伦科, 滕少华, 张巍. 余弦相似度保持的掌纹识别算法[J]. 广东工业大学学报, 2022, 39(03): 55-62.
[14] Gary Yen, 栗波, 谢胜利. 地球流体动力学模型恢复的长短期记忆网络渐进优化方法[J]. 广东工业大学学报, 2021, 38(06): 1-8.
[15] 邓杰航, 袁仲鸣, 林好润, 顾国生. 协同超像素和视觉显著性的图像质量评价[J]. 广东工业大学学报, 2021, 38(05): 33-39.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!