Evidence-Grounded Chest X-ray Report Generation with Retrieval, Citation, and Hallucination Control

Authors

  • Danang Danang Universitas Sains dan Teknologi Komputer
  • Toni Wijanarko Adi Putra Universitas Sains dan Teknologi Komputer

DOI:

https://doi.org/10.57214/jusika.v7i2.1127

Keywords:

chest X-ray, evidence grounding, hallucination control, radiology report generation, retrieval-augmented generation

Abstract

Chest X-ray report generation has become an important topic in vision-to-language research. However, fully generative models often create fluent clinical reports that may contain unsupported or inaccurate statements, leading to hallucination problems and reducing reliability. This study investigates evidence-grounded report generation using the Open-i (IU X-ray) dataset with two main goals: generating coherent radiology reports from X-ray images and minimizing unsupported clinical entities through evidence retrieval. Four experimental models were evaluated: a baseline image-to-report model (E1), an alignment-enhanced model using the InfoNCE objective (E2), a retrieval-grounded model that incorporates Top-K evidence sentences with citation markers such as [E#] (E3), and a reranking model that selects the most evidence-supported output (E4). Experimental results on the Open-i test dataset show that grounding methods significantly reduce hallucination rates and improve entity overlap performance. The reranking approach achieves the best grounding quality, although stronger grounding slightly lowers text-overlap scores and increases inference time. Overall, retrieval-based grounding with explicit citations and reranking offers an effective approach for improving factual consistency and reducing unsupported information in automated radiology report generation.

References

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT), 610–623. https://doi.org/10.1145/3442188.3445922

Boag, W., Wittenberg, E., Folkman, L., Khosla, S., & Manrai, A. (2020). Chexbert: Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Chen, Z., Song, Y., Chang, T.-H., & Wan, X. (2020). Generating radiology reports via memory-driven transformer. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1439–1449. https://doi.org/10.18653/v1/2020.emnlp-main.112

Chen, Z., Song, Y., Chang, T.-H., & Wan, X. (2021). Cross-modal memory networks for radiology report generation. Pattern Recognition, 118, 108050. https://doi.org/10.1016/j.patcog.2021.108050

Demner-Fushman, D., Kohli, M. D., Rosenman, M. B., Shooshan, S. E., Rodriguez, L., Antani, S., Thoma, G. R., & McDonald, C. J. (2016). Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association, 23(2), 304–310. https://doi.org/10.1093/jamia/ocv080

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778. https://doi.org/10.1109/CVPR. 2016.90

Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilona, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., et al. (2019). Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. Proceedings of the AAAI Conference on Artificial Intelligence, 33(1), 590–597. https://doi.org/10.1609/aaai. v33i01.3301590 Izacard,

G., & Grave, E. (2021). Leveraging passage retrieval with generative models for open domain question answering. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL).

Jain, S., Delbrouck, J.-B., Vakil, P., Chi, P., Ommer, B., & Langlotz, C. (2021). Radgraph: Extracting clinical entities and relations from radiology reports. arXiv preprint arXiv:2106.14463.

Ji, Z., Lee, N., Frieske, R., et al. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), 1–38. https://doi.org/10.1145/3571730

Jing, B., Xie, P., & Xing, E. (2018). On the automatic generation of medical imaging reports. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), 2577–2586. https://doi.org/10.18653/ v1/P18-1240

Johnson, A. E. W., Pollard, T. J., Berkowitz, S. J., Greenbaum, N. R., Lungren, M. P., Deng, C.-y., Mark, R. G., & Horng, S. (2019). Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data, 6, 317. https://doi.org/10.1038/s41597-019-0322-0

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Stoyanov, V., & Riedel, S. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459–9474.

Li, Y., Liang, X., Hu, Z., & Xing, E. (2018). TieNet: Text-image embedding network for common thorax disease classification and reporting in chest x-rays. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 9049–9058. https://doi.org/10.1109/CVPR.2018.00943

Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out: Proceedings of the ACL-04 Workshop.

Liu, F., You, C., Wu, X., Xu, G., Liu, Y., Liu, T., & Wang, J. (2021). Multi-modal transformer for radiology report generation. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP).

Nguyen, H., et al. (2022). BioViL: Self-supervised vision–language pretraining for biomedical imaging. arXiv preprint arXiv:2204.09817.

Oord, A. v. d., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.

Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), 311–318.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. Proceedings of the 38th International Conference on Machine Learning (ICML), 8748–8763.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

Vedantam, R., Zitnick, C. L., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4566–4575. https://doi.org/10. 1109/CVPR.2015.7299087 14

Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., & Summers, R. M. (2017). Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2097–2106

Downloads

Published

2023-12-31

How to Cite

Danang Danang, & Toni Wijanarko Adi Putra. (2023). Evidence-Grounded Chest X-ray Report Generation with Retrieval, Citation, and Hallucination Control. Jurnal Sains Dan Kesehatan, 7(2), 53–74. https://doi.org/10.57214/jusika.v7i2.1127