融合多尺度特征与位置信息的遥感跨模态检索方法

    A Cross-modal Retrieval Method for Remote Sensing Images Integrating Multi-scale Features and Location Information

    • 摘要: 实现精准的遥感图像与文本的跨模态检索,关键在于多模态特征的有效学习和跨模态信息的精确对齐。然而遥感图像往往存在目标尺度变化大的特点,导致特征学习常常无法精准表征图像目标的位置信息,且难以全面刻画图像细节与整体语义信息。为解决上述问题,提出一种融合CNN与Transformer的混合网络,以增强模型对遥感图像的细节和整体结构的语义信息整合表征能力。在该网络设计中,针对性地设计了多尺度特征优化模块,以提高模型对不同尺度目标的敏感性,并同时嵌入位置信息来强化跨模态数据的精确对齐;此外,在2种模态特征学习后引入对内容的自适应池化机制,以增强各模态特征提取过程中重要语义信息的保留,促进跨模态语义对齐。实验结果表明,在RSICD和RSITMD数据集上,本文方法在前K召回率(R@K)上均优于现有主流方法,显著提升了跨模态检索性能。

       

      Abstract: Accurate cross-modal retrieval of remote sensing images and text depends on effective multimodal feature learning and precise alignment of cross-modal information. However, remote sensing images often exhibit significant variations in target scale, making it challenging for feature learning methods to accurately capture positional information and comprehensively represent both image details and overall semantics. To address this issue, a hybrid network combining CNN and Transformer is proposed to enhance the mode’s capability in semantic representation of both fine-grained details and overall structure of remote sensing images. The network incorporates a multi-scale feature optimization module to improve sensitivity to targets of varying scales, and positional information is embedded to enhance the precise alignment of cross-modal data. Additionally, an adaptive pooling mechanism is introduced after feature learning in both modalities to retain critical semantic information, facilitating better cross-modal semantic alignment. Experimental results on the RSICD and RSITMD datasets show that the proposed method outperforms existing mainstream approaches in recall at K (R@K) , significantly improving cross-modal retrieval performance.

       

    /

    返回文章
    返回