Journal of Guangdong University of Technology ›› 2024, Vol. 41 ›› Issue (04): 89-97.doi: 10.12052/gdutxb.230122

• Computer Science and Technology • Previous Articles    

Active Mining Sample Pair Semantics for Image-text Matching

Chen Yong-feng, Liu Jing, Yang Zhi-jing, Chen Rui-han, Tan Jun-peng   

  1. School of Information Engineering, Guangdong University of Technology, Guangzhou 510006, China
  • Received:2023-08-29 Published:2024-08-13

Abstract: Aiming at the shortcomings that the existing image-text matching algorithms based on common-sense learning cannot effectively match the intractable negative samples in image-text sample pairs, and the generalization ability of the models is weak and ineffective on large-scale datasets, a novel image-text matching model called Active Mining Sample Pair Semantics image-text matching model is proposed. Firstly, the proposed Adaptive Hierarchical Reinforcement Loss has diversified learning modes, and on top of the traditional triple loss, predictive candidate instances (pairs of intractable sample pairs) are added to aid in training. Its active learning mode enables model to more focus on the intractable negative samples through a penalizing mechanism to enhance the discriminative ability. In addition, the proposed model can also adaptively mine more hidden relevant semantic representations from uncommented items, which greatly improves the performance and generalization ability of model. Finally, experimental results on Flickr30K and MSCOCO datasets show that this proposed method is superior to the existing advanced comparison methods.

Key words: image-text matching, common-sense learning, triplet loss, intractable negative pairs, cross-modal retrieval

CLC Number: 

  • TP391.4
[1] LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching[C]//Proceedings of the European Conference on Computer Vision (ECCV) . Munich: Springer International Publishing, 2018: 201-216.
[2] LI K, ZHANG Y, LI K, etal. Visual Semantic Reasoning for Image-Text Matching[C]//2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Seoul, Korea (South) : IEEE, 2019: . 4653-4661.
[3] 李濛. 结合知识的图文匹配算法研究与实现[D]. 成都: 电子科技大学, 2022.
[4] 任思宇. 基于语义推理的图文匹配方法的研究[D]. 天津: 天津工业大学, 2021.
[5] SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , Las Vegas: IEEE, 2016: 2818-2826.
[6] 林哲煌, 李东. 语义引导下自适应拓扑推理图卷积网络的人体动作识别[J]. 广东工业大学学报, 2023, 40(4): 45-52.
LIN Z H, LI D. Semantics-guided adaptive topology inference graph convolutional networks for skeleton-based action recognition. [J]. Journal of Guangdong University of Technology, 2023, 40(4): 45-52.
[7] LU R T, YANG X G, JING X, et al. Infrared small target detection based on local hypergraph dissimilarity measure [J]. IEEE Geoscience and Remote Sensing Letters, 2020, 19: 1-5.
[8] SUN J H, QING C M, TAN J P, et al. Superpoint transformer for 3d scene instance segmentation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Washington: AAAI, 2023, 37(2) : 2393-2401.
[9] FAGHRI F, FLEET D J, KIROS J R, et al. Improving visual-semantic embeddings with hard negatives[J]. arXiv: 1707.05612(2017-12-12) [2022-05-12]. https://arxiv.org/pdf/1707.05612.pdf.
[10] 林志刚. 基于多任务和注意力机制的图文匹配技术研究[D]. 天津: 天津大学, 2020.
[11] EISENSCHTAT A, WOLF L. Linking image and text with 2-way nets[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Honolulu: IEEE, 2017: 1855-1865.
[12] KARPATHY A , LI F. Deep visual-semantic alignments for generating image descriptions[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Boston: IEEE, 2015: 3128-3137.
[13] WANG J, ZHOU F, WEN S, et al. Deep metric learning with angular loss[C]//2017 IEEE International Conference on Computer Vision (ICCV) . Venice: IEEE, 2017: 2612-2620.
[14] TAN J P, YANG Z J, REN J C, et al. A novel robust low-rank multi-view diversity optimization model with adaptive-weighting based manifold learning [J]. Pattern Recognition, 2022, 122: 108298.
[15] WANG H R, ZHANG Y, JI Z, et al. Consensus-aware visual-semantic embedding for image-text matching[C]//Computer Vision—ECCV 2020: 16th European Conference. Glasgow: Springer International Publishing, 2020: 18-34.
[16] BOUKTHIR K, QAHTANI A M, ALMUTIRY O, et al. Reduced annotation based on deep active learning for arabic text detection in natural scene images [J]. Pattern Recognition Letters, 2022, 157: 42-48.
[17] DU P, CHEN H, ZHAO S Y, et al. Contrastive active learning under class distribution mismatch [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(4): 4260-4273.
[18] VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Boston: IEEE, 2015: 4566-4575.
[19] MATSUBARA T. Target-oriented deformation of visual-semantic embedding space [J]. IEICE Transactions on Information and Systems, 2021, 104(1): 24-33.
[20] LIU C, MAO Z, ZANG W, et al. A neighbor-aware approach for image-text matching[C]//ICASSP 2019—2019 IEEE International Conference on Acoustics. Brighton: IEEE, 2019: 3970-3974.
[21] LIU F Y, YE R T, WANG X, et al. Hal: improved text-image matching by mitigating visual semantic hubs[C]//Proceedings of the AAAI Conference on Artificial Intelligence. New York: IEEE 2020, 34(7) : 11563-11571.
[22] ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6077-6086.
[23] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks [J]. Advances in Neural Information Processing Systems, 2015, 28: 91-99.
[24] SCHUSTER M, PALIWAL K K. Bidirectional recurrent neural networks [J]. IEEE Transactions on Signal Processing, 1997, 45(11): 2673-2681.
[25] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [J]. Advances in Neural Information Processing Systems, 2017, 30: 6000-6010.
[26] NAM H, HA J W, KIM J. Dual attention networks for multimodal reasoning and matching[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Hawaii: IEEE, 2017: 299-307.
[27] CHEN T L, DENG J J, LUO J B. Adaptive offline quintuplet loss for image-text matching[C]//Computer Vision-ECCV 2020: 16th European Conference. Glasgow: Springer International Publishing, 2020: 549-565.
[28] DENKOWSKI M, LAVIE A. Meteor universal: language specific translation evaluation for any target language[C]//Proceedings of the Ninth Workshop on Statistical Machine Translation. Baltimore: ACL, 2014: 376-380.
[29] PlUMMER B A, WANG L, CERVANTES C M, et al. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models[C]//2015 IEEE International Conference on Computer Vision (ICCV) . Santiago, Chile: IEEE, 2015: 2641-2649.
[30] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: common objects in context[C]//Computer Vision-ECCV 2014: 13th European Conference. Zurich, Switzerland: Springer International Publishing, 2014: 740-755.
[31] BITEN A F, MAFLA A, GOMEZ L, et al. Is an image worth five sentences? A New look into semantics for image-text matching[C]//2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) . Waikoloa: IEEE, 2022: 2483-2492.
[32] HUANG Y, WU Q, SONG C, et al. Learning semantic concepts and order for image and sentence matching[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6163-6171.
[33] WANG Z. CAMP: cross-modal adaptive message passing for text-Image retrieval[C] //2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Seoul, Korea (South) : IEEE, 2019: 5763-5772.
[34] LIU C X, MAO Z D, LIU A A, et al. Focus your attention: A bidirectional focal attention network for image-text matching[C]//Proceedings of the 27th ACM International Conference on Multimedia. France: ACM, 2019: 3-11.
[35] LIU C, MAO Z, ZHANG T, et al. Graph structured network for image-text matching[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Seattle: IEEE, 2020: 10918-10927.
[36] DIAO H, ZHANG Y, MA L, et al. Similarity reasoning and filtration for image-text matching[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Internet: AAAI, 2021, 35(2) : 1218-1226.
[1] Li Xue-sen, Tan Bei-hai, Yu Rong, Xue Xian-bin. Small Target Detection Algorithm for Lightweight UAV Aerial Photography Based on YOLOv5 [J]. Journal of Guangdong University of Technology, 2024, 41(03): 71-80.doi: 10.12052/gdutxb.230122
[2] Zeng An, Chen Xu-zhou, Ji Yu-Zhu, Pan Dan, Xu Xiao-Wei. Cardiac Multiclass Segmentation Method Based on Self-attention and 3D Convolution [J]. Journal of Guangdong University of Technology, 2023, 40(06): 168-175.doi: 10.12052/gdutxb.230122
[3] Wu Xiao-ling, Chen Xiang-wang, Zhan Wen-tao, Ling Jie. Chinese Medical Named Entity Recognition Based on Gated Attention Unit [J]. Journal of Guangdong University of Technology, 2023, 40(06): 176-184.doi: 10.12052/gdutxb.230122
[4] Wu Zhen-hua, Tang Wen-yan, Lyu Wen-ge, Chen Ru-jie, Hou Meng-hua, Li De-yuan. Fast Image Segmentation with Multilevel Threshold of Two-dimensional Entropy Based on ISSA and Integral Graph [J]. Journal of Guangdong University of Technology, 2023, 40(05): 47-55.doi: 10.12052/gdutxb.230122
[5] Zhong Geng-jun, Li Dong. A Channel-splited Based Dual-branch Block for 3D Point Cloud Processing [J]. Journal of Guangdong University of Technology, 2023, 40(04): 18-23.doi: 10.12052/gdutxb.230122
[6] Lin Zhe-huang, Li Dong. Semantics-guided Adaptive Topology Inference Graph Convolutional Networks for Skeleton-based Action Recognition [J]. Journal of Guangdong University of Technology, 2023, 40(04): 45-52.doi: 10.12052/gdutxb.230122
[7] Huang Xiao-yong, Li Wei-tong. Fall Detection Algorithm Based on TSSI and STB-CNN [J]. Journal of Guangdong University of Technology, 2023, 40(04): 53-59.doi: 10.12052/gdutxb.230122
[8] Chen Xiao-rong, Yang Xue-rong, Cheng Si-yuan, Liu Guo-dong. Surface Defect Detection of Lithium Battery Electrodes Based on Improved Unet Network [J]. Journal of Guangdong University of Technology, 2023, 40(04): 60-66,93.doi: 10.12052/gdutxb.230122
[9] Cao Zhi-xiong, Wu Xiao-ling, Luo Xiao-wei, Ling Jie. Helmet Wearing Detection Algorithm Intergrating Transfer Learning and YOLOv5 [J]. Journal of Guangdong University of Technology, 2023, 40(04): 67-76.doi: 10.12052/gdutxb.230122
[10] Xie Guo-bo, Lin Li, Lin Zhi-yi, He Di-xuan, Wen Gang. An Insulator Burst Defect Detection Method Based on YOLOv4-MP [J]. Journal of Guangdong University of Technology, 2023, 40(02): 15-21.doi: 10.12052/gdutxb.230122
[11] Zou Heng, Gao Jun-li, Zhang Shu-wen, Song Hai-tao. Design and Implementation of a Dropping Guidance Device for Go Robot [J]. Journal of Guangdong University of Technology, 2023, 40(01): 77-82,91.doi: 10.12052/gdutxb.230122
[12] Yi Min-qi, Liu Hong-wei, Gao Hong-ming. Research on the Factors Influencing the Co-purchase Network of Products on E-commerce Platforms [J]. Journal of Guangdong University of Technology, 2022, 39(03): 16-24.doi: 10.12052/gdutxb.230122
[13] Qiu Zhan-chun, Fei Lun-ke, Teng Shao-hua, Zhang Wei. Palmprint Recognition Based on Cosine Similarity [J]. Journal of Guangdong University of Technology, 2022, 39(03): 55-62.doi: 10.12052/gdutxb.230122
[14] Gary Yen, Li Bo, Xie Sheng-li. An Evolutionary Optimization of LSTM for Model Recovery of Geophysical Fluid Dynamics [J]. Journal of Guangdong University of Technology, 2021, 38(06): 1-8.doi: 10.12052/gdutxb.230122
[15] Deng Jie-hang, Yuan Zhong-ming, Lin Hao-run, Gu Guo-sheng. Superpixel and Visual Saliency Synergetic Image Quality Assessment [J]. Journal of Guangdong University of Technology, 2021, 38(05): 33-39.doi: 10.12052/gdutxb.230122
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!