Abstract:
Automatic medical report generation technology plays an important role in auxiliary diagnosis and can greatly reduce the workload of medical workers. As deep learning continues to develop in the medical field, automatic medical report generation technology has become one of the research hotspots. Currently, the main challenges in medical report generation are (1) the difficulty of capturing lesion regions in images by models, and (2) the large semantic gap between visual and language semantics, whose consistency problem is still not well solved. Therefore, in order to solve the above problems, a Cross-Modal Discrepancy Attention Network (CDAN) is proposed to bring closer the semantics between different modalities. The network includes a Reverse Attention (RA) module and a Semantic Consistency (SC) module: (1) the Reverse Attention module explores important areas in medical images more comprehensively, and (2) the Semantic Consistency module utilizes the features of the large language model as a reference to guide the visual features to continuously approach the reference language features, so that the visual semantics can be more accurately converted into language semantics. Experiments show that the Cross-Modal Discrepancy Attention Network is better than the previous model on both IU X-Ray and MIMIC-CXR public datasets, with BLEU4 scores reaching 17.9% and 10.9% respectively. Compared with the baseline model, improvement is significant in performance, which proves that the proposed model is capable of generating accurate and fluent medical reports.