Vision-Language Hash Learning for Remote Sensing Scene Retrieval Based on Asymmetric Semantic Representation Mining

Ganav Hishra; Kasper Kennedy; Bennett A. Carpenter; Kasper Burton

Authors

Ganav Hishra Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS, USA.
Kasper Kennedy Department of Computer Science, University of Alabama at Birmingham, Birmingham, AL, USA.
Bennett A. Carpenter Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA.
Kasper Burton School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA.

Keywords:

remote sensing scene retrieval, vision-language hashing, asymmetric semantic mining, deep learning, cross-modal retrieval, large-scale systems, robustness, fairness, governance

Abstract

The exponential growth of remote sensing imagery archives demands scalable, semantically precise retrieval mechanisms capable of bridging the gap between high-dimensional visual data and human-expressed queries. Vision-language hash learning has emerged as a compelling paradigm, encoding cross-modal semantic correspondences into compact binary codes that support fast approximate nearest neighbor search in large-scale repositories. This paper presents a systems-oriented examination of vision-language hash learning for remote sensing scene retrieval, founded upon asymmetric semantic representation mining. Conventional symmetric alignment strategies often fail to account for the inherent information density imbalance between satellite imagery and textual descriptions, where visual scenes contain rich spectral and spatial detail that is only partially captured in short query statements. We argue that intentionally asymmetric modalities of representation, in which the visual and language encoders learn complementary rather than strictly matched embeddings, unlock superior retrieval fidelity when combined with sophisticated hash coding. The paper foregrounds architectural trade-offs, infrastructure requirements, deployment models, and governance frameworks that shape the real-world viability of such systems. We discuss how decisions regarding model complexity, hash code length, training data composition, and inference distribution carry profound implications for sustainability, fairness, and robustness. Cross-domain comparisons with multimedia and medical image retrieval highlight unique challenges in the remote sensing domain, including geospatial bias, temporal variability, and the coexistence of heterogeneous sensor modalities. Policy considerations around data sovereignty, dual-use governance, and the environmental footprint of large-scale multi-modal training are integrated into a holistic assessment. The paper concludes by identifying open research frontiers at the intersection of asymmetric learning, hash-based indexing, and socio-technical infrastructure design.

References

1. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097–1105.

2. Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations.

3. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 4171–4186.

4. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. Proceedings of the 38th International Conference on Machine Learning, 8748–8763.

5. Li, G., Duan, N., Fang, Y., Gong, M., & Jiang, D. (2020). Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7), 11336–11344.

6. Gong, Y., Lazebnik, S., Gordo, A., & Perronnin, F. (2013). Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12), 2916–2929.

7. Liu, W., Wang, J., Ji, R., Jiang, Y.-G., & Chang, S.-F. (2012). Supervised hashing with kernels. IEEE Conference on Computer Vision and Pattern Recognition, 2074–2081.

8. Zhu, H., Long, M., Wang, J., & Cao, Y. (2016). Deep hashing network for efficient similarity retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1), 2415–2421.

9. Lu, X., Zhang, L., & Li, Z. (2020). Learning discriminative deep features for remote sensing scene classification. IEEE Transactions on Geoscience and Remote Sensing, 58(8), 5666–5681.

10. Li, Y., Zhang, Y., Huang, X., & Han, J. (2021). Remote sensing image retrieval using deep hashing with weighted triplet loss. IEEE Geoscience and Remote Sensing Letters, 19, 1–5.

11. Jeon, J., Lavrenko, V., & Manmatha, R. (2003). Automatic image annotation and retrieval using cross-media relevance models. Proceedings of the 26th Annual International ACM SIGIR Conference, 119–126.

12. Feng, F., Wang, X., & Li, R. (2014). Cross-modal retrieval with correspondence autoencoder. Proceedings of the 22nd ACM International Conference on Multimedia, 7–16.

13. Wang, Z., Li, Q., & Tao, D. (2016). Asymmetric multi-task learning for visual search. IEEE Transactions on Image Processing, 25(8), 3869–3882.

14. Zhang, D., Han, J., Zhao, L., & Meng, D. (2019). Leveraging prior-knowledge for weakly supervised object detection in remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 57(9), 6962–6975.

15. Peng, Y., Qi, J., & Zhuo, Y. (2020). MAVA: Multi-level adaptive visual-textual alignment by cross-media bidirectional matching. Proceedings of the 28th ACM International Conference on Multimedia, 1728–1736.

16. Yu, Z., Wu, S., Dou, Z., & Bakker, E. M. (2022). Deep hashing with self-supervised asymmetric semantic excavation and margin-scalable constraint. Neurocomputing, 483, 87-104.

17. Ma, L., Liu, Y., & Liu, X. (2022). Learning to hash for big data: Current status and future trends. IEEE Transactions on Neural Networks and Learning Systems, 33(12), 7015–7036.

18. Norouzi, M., Fleet, D. J., & Salakhutdinov, R. (2012). Hamming distance metric learning. Advances in Neural Information Processing Systems, 25, 1061–1069.

19. Weiss, Y., Torralba, A., & Fergus, R. (2008). Spectral hashing. Advances in Neural Information Processing Systems, 21, 1753–1760.

20. Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision, 2980–2988.

21. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition, 770–778.

22. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998–6008.

23. Redmon, J., & Farhadi, A. (2018). YOLOv3: An incremental improvement. arXiv preprint arXiv:1804.02767.

24. Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 1715–1725.

25. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826.

Vision-Language Hash Learning for Remote Sensing Scene Retrieval Based on Asymmetric Semantic Representation Mining

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Journal Information

Indexing & Infrastructure

Current Issue

Information

Make a Submission