Evidence-Grounded Chest X-ray Report Generation with Retrieval, Citation, and Hallucination Control
DOI:
https://doi.org/10.57214/jusika.v7i2.1127Keywords:
chest X-ray, evidence grounding, hallucination control, radiology report generation, retrieval-augmented generationAbstract
Chest X-ray report generation has become an important topic in vision-to-language research. However, fully generative models often create fluent clinical reports that may contain unsupported or inaccurate statements, leading to hallucination problems and reducing reliability. This study investigates evidence-grounded report generation using the Open-i (IU X-ray) dataset with two main goals: generating coherent radiology reports from X-ray images and minimizing unsupported clinical entities through evidence retrieval. Four experimental models were evaluated: a baseline image-to-report model (E1), an alignment-enhanced model using the InfoNCE objective (E2), a retrieval-grounded model that incorporates Top-K evidence sentences with citation markers such as [E#] (E3), and a reranking model that selects the most evidence-supported output (E4). Experimental results on the Open-i test dataset show that grounding methods significantly reduce hallucination rates and improve entity overlap performance. The reranking approach achieves the best grounding quality, although stronger grounding slightly lowers text-overlap scores and increases inference time. Overall, retrieval-based grounding with explicit citations and reranking offers an effective approach for improving factual consistency and reducing unsupported information in automated radiology report generation.
References
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT), 610–623. https://doi.org/10.1145/3442188.3445922
Boag, W., Wittenberg, E., Folkman, L., Khosla, S., & Manrai, A. (2020). Chexbert: Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Chen, Z., Song, Y., Chang, T.-H., & Wan, X. (2020). Generating radiology reports via memory-driven transformer. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1439–1449. https://doi.org/10.18653/v1/2020.emnlp-main.112
Chen, Z., Song, Y., Chang, T.-H., & Wan, X. (2021). Cross-modal memory networks for radiology report generation. Pattern Recognition, 118, 108050. https://doi.org/10.1016/j.patcog.2021.108050
Demner-Fushman, D., Kohli, M. D., Rosenman, M. B., Shooshan, S. E., Rodriguez, L., Antani, S., Thoma, G. R., & McDonald, C. J. (2016). Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association, 23(2), 304–310. https://doi.org/10.1093/jamia/ocv080
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778. https://doi.org/10.1109/CVPR. 2016.90
Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilona, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., et al. (2019). Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. Proceedings of the AAAI Conference on Artificial Intelligence, 33(1), 590–597. https://doi.org/10.1609/aaai. v33i01.3301590 Izacard,
G., & Grave, E. (2021). Leveraging passage retrieval with generative models for open domain question answering. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL).
Jain, S., Delbrouck, J.-B., Vakil, P., Chi, P., Ommer, B., & Langlotz, C. (2021). Radgraph: Extracting clinical entities and relations from radiology reports. arXiv preprint arXiv:2106.14463.
Ji, Z., Lee, N., Frieske, R., et al. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), 1–38. https://doi.org/10.1145/3571730
Jing, B., Xie, P., & Xing, E. (2018). On the automatic generation of medical imaging reports. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), 2577–2586. https://doi.org/10.18653/ v1/P18-1240
Johnson, A. E. W., Pollard, T. J., Berkowitz, S. J., Greenbaum, N. R., Lungren, M. P., Deng, C.-y., Mark, R. G., & Horng, S. (2019). Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data, 6, 317. https://doi.org/10.1038/s41597-019-0322-0
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Stoyanov, V., & Riedel, S. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459–9474.
Li, Y., Liang, X., Hu, Z., & Xing, E. (2018). TieNet: Text-image embedding network for common thorax disease classification and reporting in chest x-rays. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 9049–9058. https://doi.org/10.1109/CVPR.2018.00943
Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out: Proceedings of the ACL-04 Workshop.
Liu, F., You, C., Wu, X., Xu, G., Liu, Y., Liu, T., & Wang, J. (2021). Multi-modal transformer for radiology report generation. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP).
Nguyen, H., et al. (2022). BioViL: Self-supervised vision–language pretraining for biomedical imaging. arXiv preprint arXiv:2204.09817.
Oord, A. v. d., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), 311–318.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. Proceedings of the 38th International Conference on Machine Learning (ICML), 8748–8763.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
Vedantam, R., Zitnick, C. L., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4566–4575. https://doi.org/10. 1109/CVPR.2015.7299087 14
Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., & Summers, R. M. (2017). Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2097–2106
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Jurnal Sains dan Kesehatan

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.






