广东工业大学学报

• •    

基于跨模态差异注意力的医学报告生成

陈嘉鸿, 黄国恒, 谭喆   

  1. 广东工业大学 计算机学院,广东 广州 510006
  • 收稿日期:2024-01-02 出版日期:2024-09-27 发布日期:2024-09-27
  • 通信作者: 黄国恒(1985–),男,副教授,博士,主要研究方向为计算机视觉、模式识别和人工智能,E-mail:kevinwong@gdut.edu.cn
  • 作者简介:陈嘉鸿(1998–),男,硕士研究生,主要研究方向为医学报告生成、图像描述,E-mail:jhchan003@gmail.com
  • 基金资助:
    广东省重点领域研发计划项目(2019B010153002);佛山市重点领域科技攻关项目(2020001006832)

Cross-modal Discrepancy Attention Network for Medical Report Generation

Chen Jia-hong, Huang Guo-heng, Tan Zhe   

  1. School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China
  • Received:2024-01-02 Online:2024-09-27 Published:2024-09-27

摘要: 医学报告自动生成技术对辅助诊断起着重要作用,能够极大减轻医护工作者的工作量。随着深度学习在医学领域不断发展,医学报告自动生成技术已成为智慧医疗领域里的研究热点之一。目前,医学报告生成的主要挑战是图像中的病灶区域难以被模型捕捉,以及视觉和语言语义之间存在较大的语义鸿沟,其一致性问题仍没有很好地解决。因此,本文提出了跨模态差异注意力网络拉近不同模态之间的语义,该网络包括反向注意力模块和语义一致模块:反向注意力模块更全面探索医学图像中的重要区域;语义一致模块利用大语言模型的特征作为参考,引导视觉特征不断靠近参考文本特征,使得视觉语义更准确地转化成一致地语言语义。实验表明,跨模态差异注意力网络在IU X-Ray和MIMIC-CXR两个公开数据集上的表现均优于之前的模型,在BLEU4上的指标分数分别达到17.9%和10.9%,相比于基线模型,本文模型性能有较大的提高,证明了本文所提模型能生成准确和流畅的医学报告。

关键词: 医学报告生成, 语义一致, 注意力机制

Abstract: Automatic medical report generation technology plays an important role in auxiliary diagnosis and can greatly reduce the workload of medical workers. As deep learning continues to develop in the medical field, automatic medical report generation technology has become one of the research hotspots. Currently, the main challenges in medical report generation are (1) the difficulty of capturing lesion regions in images by models, and (2) the large semantic gap between visual and language semantics, whose consistency problem is still not well solved. Therefore, in order to solve the above problems, a Cross-Modal Discrepancy Attention Network (CDAN) is proposed to bring closer the semantics between different modalities. The network includes a Reverse Attention (RA) module and a Semantic Consistency (SC) module: (1) the Reverse Attention module explores important areas in medical images more comprehensively, and (2) the Semantic Consistency module utilizes the features of the large language model as a reference to guide the visual features to continuously approach the reference language features, so that the visual semantics can be more accurately converted into language semantics. Experiments show that the Cross-Modal Discrepancy Attention Network is better than the previous model on both IU X-Ray and MIMIC-CXR public datasets, with BLEU4 scores reaching 17.9% and 10.9% respectively. Compared with the baseline model, improvement is significant in performance, which proves that the proposed model is capable of generating accurate and fluent medical reports.

Key words: medical report generation, semantic consistency, attention mechanism

中图分类号: 

  • TP391.4
[1] 李小雷. 人工智能在医学影像图像处理中的研究进展[J]. 中国医学计算机成像杂志, 2023, 29(4): 454-457.
LI X L. Advances in the application of artificial intelligence in medical image processing [J]. Chinese Computed Medical Imaging, 2023, 29(4): 454-457.
[2] 丛超. 基于多模态医学影像智能分析的深度学习算法研究与应用 [D]. 重庆: 中国人民解放军陆军军医大学, 2023.
[3] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: A neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2015: 3156-3164.
[4] XU K, BA J, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention[C]//International Conference on Machine Learning. San Diego: ACM , 2015: 2048-2057.
[5] CHEN Z, SING Y, CHANG T H, et al. Generating radiology reports via memory-driven transformer[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Stroudsburg: ACL, 2020: 1439-1449.
[6] JING B, XIE P, XING E. On the automatic generation of medical imaging reports[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Stroudsburg: ACL, 2018: 2577-2586.
[7] LI Y, LIANG X, HU Z, et al. Hybrid retrieval-generation reinforced agent for medical image report generation[C]//Advances in Neural Information Processing Systems. California: NIPS, 2018: 1537-1547.
[8] LIU G, HSU T M H, MCDERMOTT M, et al. Clinically accurate chest x-ray report generation[C]//Machine Learning for Healthcare Conference. New York: Proceedings of Machine Learning Research (PMLR) , 2019: 249-269.
[9] RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2017: 7008-7024.
[10] LU J, XIONG C, PARIKH D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2017: 375-383.
[11] ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 6077-6086.
[12] FARHADI A, HEJRATI M, SADEGHI M A, et al. Every picture tells a story: generating sentences from images[C]//Computer Vision–ECCV 2010: 11th European Conference on Computer Vision. Berlin Heidelberg: Springer, 2010: 15-29.
[13] MITCHELL M, DODGE J, GOYAL A, et al. Midge: generating image descriptions from computer vision detections[C]//Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg: ACL, 2012: 747-756.
[14] GONG Y, WANG L, HODOSH M, et al. Improving image-sentence embeddings using large weakly annotated photo collections[C]//Computer Vision–ECCV 2014: 13th European Conference. Berlin Heidelberg: Springer, 2014: 529-545.
[15] GUPTA A, VERMA Y, JAWAHAR C. Choosing linguistics over vision to describe images[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2012, 26(1) : 606-612.
[16] KRAUSE J, JOHNSON J, KRISHNA R, et al. A hierarchical approach for generating descriptive image paragraphs[C]//Proceedings of the IEEE Conference on Computer Vision and Ppattern Recognition. New York: IEEE, 2017: 317-325.
[17] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems. La Jolla: NIPS, 2017: 30.
[18] YU J, LI J, YU Z, et al. Multimodal transformer with multi-view visual representation for image captioning [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30(12): 4467-4480.
[19] LI Y, LIANG X, HU Z, et al. Hybrid retrieval-generation reinforced agent for medical image report generation[J]. Advances in Neural Information Processing Systems. La Jolla: NIPS, 2018, 31.
[20] WANG X, PENG Y, LU L, et al. Tienet: text-image embedding network for common thorax disease classification and reporting in chest x-rays[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 9049-9058.
[21] YIN C, QIAN B, WEI J, et al. Automatic generation of medical imaging diagnostic report with hierarchical recurrent neural network[C]//2019 IEEE International Conference on Data Mining (ICDM) . New York: IEEE, 2019: 728-737.
[22] LI C Y, LIANG X, HU Z, et al. Knowledge-driven encode, retrieve, paraphrase for medical image report generation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2019, 33: 6666-6673.
[23] ZHANG Y, WANG X, XU Z, et al. When radiology report generation meets knowledge graph[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2020, 34: 12910-12917.
[24] YOU D, LIU F, GE S, et al. Aligntransformer: Hierarchical alignment of visual regions and disease tags for medical report generation[C]//Medical Image Computing and Computer Assisted Intervention-MICCAI 2021. Berlin Heidelberg: Springer, 2021: 72-82.
[25] IRVIN J, RAJPURKAR P, KO M, et al. Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI, 2019, 33(1) : 590-597.
[26] ALSENTZER E, MURPHY J, BOAG W, et al. Publicly Available Clinical BERT Embeddings[C]//Proceedings of the 2nd Clinical Natural Language Processing Workshop. Stroudsburg: ACL, 2019: 72-78.
[27] DEMNER-FUSHMAN D, KOHLI M D, ROSENMAN M B, et al. Preparing a collection of radiology examinations for distribution and retrieval [J]. Journal of the American Medical Informatics Association, 2016, 23: 304-310.
[28] JOHNSON A E W, POLLARD T J, GREENBAUM N R, et al. MIMIC-CXR:a large publicly available database of labeled chest radiographs[J]. arXiv: 1901.07042 (2019-11-14) [2024-03-18].https://arxiv.org/abs/1901.07042.
[29] JING B, WANG Z, XING E. Show, describe and conclude: on exploiting the structure information of chest x-ray reports[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2019: 6570-6580.
[30] LIU F, GE S, WU X. Competence-based multimodal curriculum learning for medical report generation[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) . Stroudsburg: ACL, 2021: 3001-3012.
[31] CHEN Z, SHEN Y, SONG Y, et al. Cross-modal Memory Networks for Radiology Report Generation[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) . Stroudsburg: ACL, 2021: 5904-5914.
[32] LIU F, WU X, GE S, et al. Exploring and distilling posterior and prior knowledge for radiology report generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE, 2021: 13753-13762.
[33] 谭立玮, 张淑军, 韩琪等. 面向医学影像报告生成的门归一化编解码网络[J]. 智能系统学报, 2024, 19: 411-419.
TAN L W, ZHANG S J, HAN Q. Gate normalized encoder-decoder network for medical image report generation [J]. CAAI Transactions on Intelligent Systems, 2024, 19: 411-419.
[34] WANG J, BHALERAO A, HE Y. Cross-modal prototype driven network for radiology report generation[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland. Berlin Heidelberg: Springer, 2022: 563-579.
[35] PAPINENI K, ROUKOS S, WARD T, et al. Bleu: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Somerset: ACL, 2002: 311-318.
[36] DENKOWSKI M, LAVIE A. Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems[C]//Proceedings of the Sixth Workshop on Statistical Machine Translation. Stroudsburg: ACL, 2011: 85-91.
[37] LIN C Y. Rouge: a package for automatic evaluation of summaries[C]//Text Summarization Branches Out. Barcelona: ACL, 2004: 74-81.
[1] 李雪森, 谭北海, 余荣, 薛先斌. 基于YOLOv5的轻量化无人机航拍小目标检测算法[J]. 广东工业大学学报, 2024, 41(03): 71-80.
[2] 涂泽良, 程良伦, 黄国恒. 基于局部正交特征融合的小样本图像分类[J]. 广东工业大学学报, 2024, 41(02): 73-83.
[3] 杨镇雄, 谭台哲. 基于生成对抗网络的低光照图像增强算法[J]. 广东工业大学学报, 2024, 41(01): 55-62.
[4] 曾安, 赖峻浩, 杨宝瑶, 潘丹. 基于多尺度卷积和注意力机制的病理图像分割网络[J]. 广东工业大学学报, 2024, 41(0): 0-.
[5] 赖志茂, 章云, 李东. 基于Transformer的人脸深度伪造检测技术综述[J]. 广东工业大学学报, 2023, 40(06): 155-167.
[6] 曾安, 陈旭宙, 姬玉柱, 潘丹, 徐小维. 基于自注意力和三维卷积的心脏多类分割方法[J]. 广东工业大学学报, 2023, 40(06): 168-175.
[7] 吴亚迪, 陈平华. 基于用户长短期偏好和音乐情感注意力的音乐推荐模型[J]. 广东工业大学学报, 2023, 40(04): 37-44.
[8] 曹智雄, 吴晓鸰, 骆晓伟, 凌捷. 融合迁移学习与YOLOv5的安全帽佩戴检测算法[J]. 广东工业大学学报, 2023, 40(04): 67-76.
[9] 赖东升, 冯开平, 罗立宏. 基于多特征融合的表情识别算法[J]. 广东工业大学学报, 2023, 40(03): 10-16.
[10] 吴俊贤, 何元烈. 基于通道注意力的自监督深度估计方法[J]. 广东工业大学学报, 2023, 40(02): 22-29.
[11] 刘洪伟, 林伟振, 温展明, 陈燕君, 易闽琦. 基于MABM的消费者情感倾向识别模型——以电影评论为例[J]. 广东工业大学学报, 2022, 39(06): 1-9.
[12] 滕少华, 董谱, 张巍. 融合语义结构的注意力文本摘要模型[J]. 广东工业大学学报, 2021, 38(03): 1-8.
[13] 梁观术, 曹江中, 戴青云, 黄云飞. 一种基于注意力机制的无监督商标检索方法[J]. 广东工业大学学报, 2020, 37(06): 41-49.
[14] 曾碧卿, 韩旭丽, 王盛玉, 徐如阳, 周武. 基于双注意力卷积神经网络模型的情感分析研究[J]. 广东工业大学学报, 2019, 36(04): 10-17.
[15] 高俊艳, 刘文印, 杨振国. 结合注意力与特征融合的目标跟踪[J]. 广东工业大学学报, 2019, 36(04): 18-23.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!