Multi-Scale Vision-Language Foundation Model for Explainable Lung Cancer Risk Assessment from CT Imaging and Nodule Segmentation

Richard Bage; Pierre Webb

Authors

Richard Bage Department of Computer Science and Engineering, University of Nevada, Reno, Reno, NV, USA.
Pierre Webb Department of Computer Science, Colorado State University, Fort Collins, CO, USA.

Keywords:

foundation model, vision-language model, lung cancer screening, nodule segmentation, explainable artificial intelligence, clinical deployment

Abstract

Lung cancer remains a leading cause of cancer mortality worldwide, and low-dose computed tomography screening has demonstrated mortality reduction through early detection. However, current computer-aided diagnosis systems often operate as opaque classifiers that provide inadequate explanations for clinical decision-making. This paper presents a multi-scale vision-language foundation model that integrates CT imaging with natural language radiology reports to deliver explainable risk assessments through structured textual justifications. The architecture combines a hierarchical vision encoder that captures nodule morphology across multiple spatial resolutions with a cross-modal alignment module that maps visual features to a domain-adapted language space. We discuss system-level design choices including the trade-offs between fine-grained segmentation accuracy and global context preservation, the challenges of aligning radiological semantics across institutions, and the infrastructure required for clinical deployment. The model employs a multi-stage training pipeline that leverages both supervised nodule segmentation and weakly supervised vision-language pre-training on large-scale chest CT-report pairs. Explainability is achieved through attention-based visual grounding and generated textual descriptions that highlight salient imaging findings, nodule characteristics, and risk-relevant features. We analyze governance and fairness implications arising from training data biases, demographic shifts, and the regulatory frameworks governing AI-assisted radiology. Robustness to scanner variability, population heterogeneity, and adversarial perturbations is examined alongside sustainability considerations for computational efficiency. We argue that vision-language foundation models can transform lung cancer screening programs by providing interpretable, evidence-based risk communication, but only if their design is accompanied by rigorous validation protocols and continuous monitoring in real-world clinical workflows.

References

1. National Lung Screening Trial Research Team. (2011). Reduced lung-cancer mortality with low-dose computed tomographic screening. New England Journal of Medicine, 365(5), 395–409.

2. Ardila, D., Kiraly, A. P., Bharadwaj, S., Choi, B., Reicher, J., Peng, L., ... & Naidich, D. P. (2019). End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nature Medicine, 25(6), 954–961.

3. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (pp. 8748–8763). PMLR.

4. Zhang, H., Li, J., Zhang, Y., Shen, Y., Campbell, W., & He, X. (2023). LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day. In Advances in Neural Information Processing Systems, 36.

5. Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T., & Zou, J. (2023). PMC-CLIP: Contrastive language-image pre-training using biomedical documents. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 259–269). Springer.

6. Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods, 18(2), 203–211.

7. Oktay, O., Schlemper, J., Folgoc, L. L., Lee, M., Heinrich, M., Misawa, K., ... & Rueckert, D. (2018). Attention U-Net: Learning where to look for the pancreas. In Medical Imaging with Deep Learning.

8. Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision (pp. 618–626).

9. Zhou, Z., Siddiquee, M. M. R., Tajbakhsh, N., & Liang, J. (2018). UNet++: A nested U-Net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support (pp. 3–11). Springer.

10. Setio, A. A. A., Ciompi, F., Litjens, G., Gerke, P., Jacobs, C., van Riel, S. J., ... & van Ginneken, B. (2017). Pulmonary nodule detection in CT images: false positive reduction using multi-view convolutional networks. IEEE Transactions on Medical Imaging, 35(5), 1160–1169.

11. Chang, C., Fu, M., Chen, X., et al. (2025, November). Research on PDU-Net Lung Nodule Segmentation Algorithm Based on Path Aggregation and Dual Attention. In 2025 4th International Conference on Image Processing, Computer Vision and Machine Learning (ICICML) (pp. 1897-1900). IEEE.

12. Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447–453.

13. Finlayson, S. G., Subbaswamy, A., Singh, K., Bowers, J., Kupke, A., Zittrain, J., ... & Saria, S. (2021). The clinician and dataset shift in artificial intelligence. New England Journal of Medicine, 385(3), 283–286.

14. Rajpurkar, P., Lungren, M. P., & Irvin, J. (2022). The current and future state of AI interpretation of medical images. New England Journal of Medicine, 386(18), 1724–1734.

15. Benjamens, S., Dhunnoo, P., & Meskó, B. (2020). The state of artificial intelligence-based FDA-approved medical devices and algorithms: an online database. npj Digital Medicine, 3(1), 118.

16. Tschandl, P., Rinner, C., Apalla, Z., Argenziano, G., Codella, N., Halpern, A., ... & Kittler, H. (2020). Human-computer collaboration for skin cancer recognition. Nature Medicine, 26, 1229–1234.

17. Xu, Y., Wang, Y., Yuan, J., Cheng, Q., Wang, X., & Carson, P. L. (2022). An explainable deep learning model for lung nodule classification using CT images. Frontiers in Oncology, 12, 852108.

18. Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., & Lu, H. (2019). Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3146–3154).

19. Li, J., Selvaraju, R. R., Gotmare, A., Joty, S., Xiong, C., & Hoi, S. C. H. (2021). BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 38th International Conference on Machine Learning (pp. 12888–12900). PMLR.

20. Koh, P. W., Sagawa, S., Marklund, H., Xie, S. M., Zhang, M., Balsubramani, A., ... & Liang, P. (2021). WILDS: A benchmark of in-the-wild distribution shifts. In Proceedings of the 38th International Conference on Machine Learning (pp. 5637–5664). PMLR.

21. Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-M., Rothchild, D., ... & Dean, J. (2021). Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350.

22. Liu, Z., Li, Y., Zang, Y., Wu, D., Liu, T., & Shen, D. (2023). Medical vision-language pre-training: A survey. arXiv preprint arXiv:2304.08024.

23. McWilliams, A., Tammemagi, M. C., Mayo, J. R., Roberts, H., Liu, G., Soghrati, K., ... & Lam, S. (2013). Probability of cancer in pulmonary nodules detected on first screening CT. New England Journal of Medicine, 369, 910–919.

Multi-Scale Vision-Language Foundation Model for Explainable Lung Cancer Risk Assessment from CT Imaging and Nodule Segmentation

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Journal Information

Indexing & Infrastructure

Current Issue

Information

Make a Submission