Abstract:
Accurate cross-modal retrieval of remote sensing images and text depends on effective multimodal feature learning and precise alignment of cross-modal information. However, remote sensing images often exhibit significant variations in target scale, making it challenging for feature learning methods to accurately capture positional information and comprehensively represent both image details and overall semantics. To address this issue, a hybrid network combining CNN and Transformer is proposed to enhance the mode’s capability in semantic representation of both fine-grained details and overall structure of remote sensing images. The network incorporates a multi-scale feature optimization module to improve sensitivity to targets of varying scales, and positional information is embedded to enhance the precise alignment of cross-modal data. Additionally, an adaptive pooling mechanism is introduced after feature learning in both modalities to retain critical semantic information, facilitating better cross-modal semantic alignment. Experimental results on the RSICD and RSITMD datasets show that the proposed method outperforms existing mainstream approaches in recall at
K (
R@
K) , significantly improving cross-modal retrieval performance.