Causal Inference-Guided Adversarial Detection and Mitigation in Medical Large Language Model Agent Architectures

Authors

  • Aarav Shetty School of Information Technology, University of Cincinnati, Cincinnati, OH, USA.
  • Petri L. Dawson Department of Computer Science, George Mason University, Fairfax, VA, USA.
  • Zizhan Jiang Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY, USA.

Keywords:

causal inference, adversarial robustness, large language models, medical AI agents, system architecture, healthcare safety, counterfactual reasoning

Abstract

The integration of large language models (LLMs) into autonomous agent architectures for clinical decision support, diagnostic reasoning, and patient-facing triage introduces unprecedented capabilities and simultaneous vulnerabilities. Medical LLM agents, characterized by multi-step tool-use, retrieval-augmented generation, and interactive reasoning loops, expand the attack surface for adversarial manipulation beyond single-turn prompt injection to causal perturbation of decision pathways. This paper presents a system-level framework that leverages causal inference to model, detect, and mitigate adversarial threats in medical LLM agent architectures. We argue that purely correlation-based robustness methods are insufficient for safety-critical clinical settings, as they fail to distinguish spurious associations from genuine causal mechanisms exploited by sophisticated adversaries. The proposed approach embeds causal structure learning and counterfactual reasoning within the agent’s execution pipeline, enabling real-time identification of anomalous intervention patterns that deviate from clinically plausible causal graphs. We examine structural trade-offs between detection latency and clinical workflow integration, discuss infrastructure requirements for deploying causal inference modules atop existing LLM stacks, and analyze fairness implications across demographic subgroups when adversarial examples exploit historical disparities encoded in training data. The paper further addresses governance challenges, including liability attribution across agent components, continuous monitoring under distribution shift, and sustainability of computational overhead for causal inference in resource-constrained healthcare settings. Throughout, we maintain a systems perspective, emphasizing how architectural decisions about causal module integration influence robustness, interpretability, and regulatory alignment. We conclude by outlining a research agenda for causally-aware medical agent design that balances real-time performance demands with the epistemic rigor required for high-stakes clinical environments.

References

1. Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., ... & Liang, P. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.

2. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.

3. Zou, A., Wang, Z., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.

4. Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., & Fritz, M. (2023). Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. arXiv preprint arXiv:2302.12173.

5. Finlayson, S. G., Bowers, J. D., Ito, J., Zittrain, J. L., Beam, A. L., & Kohane, I. S. (2019). Adversarial attacks on medical machine learning. Science, 363(6433), 1287–1289.

6. Pearl, J. (2009). Causality: Models, reasoning, and inference (2nd ed.). Cambridge University Press.

7. Schölkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., & Bengio, Y. (2021). Toward causal representation learning. Proceedings of the IEEE, 109(5), 612–634.

8. Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, Y., ... & Wen, J. (2024). A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6), 186345.

9. Carlini, N., & Wagner, D. (2017). Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (pp. 39–57). IEEE.

10. Kusner, M. J., Loftus, J., Russell, C., & Silva, R. (2017). Counterfactual fairness. Advances in Neural Information Processing Systems, 30.

11. Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447–453.

12. Hu, S. (2026). Research on Security Enhancement Methods for Adversarial Robust Large Language Model Intelligent Agents for Medical Decision-Making Tasks. arXiv preprint arXiv:2605.08257.

13. Hendrycks, D., & Dietterich, T. (2019). Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261.

14. U.S. Food and Drug Administration. (2021). Artificial intelligence/machine learning (AI/ML)-based software as a medical device (SaMD) action plan. FDA.

15. International Medical Device Regulators Forum. (2022). Software as a medical device: Possible framework for risk categorization and corresponding considerations. IMDRF/SaMD WG/N12.

16. Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, prediction, and search. MIT Press.

17. Peters, J., Janzing, D., & Schölkopf, B. (2017). Elements of causal inference: Foundations and learning algorithms. MIT Press.

18. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., ... & Zaremba, W. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.

19. Pearl, J., & Mackenzie, D. (2018). The book of why: The new science of cause and effect. Basic Books.

20. Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning (pp. 1050–1059). PMLR.

21. Imbens, G. W., & Rubin, D. B. (2015). Causal inference for statistics, social, and biomedical sciences: An introduction. Cambridge University Press.

22. Bradshaw, J. M., Hoffman, R. R., Woods, D. D., & Johnson, M. (2013). The seven deadly myths of “autonomous systems”. IEEE Intelligent Systems, 28(3), 54–59.

23. Bareinboim, E., & Pearl, J. (2016). Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences, 113(27), 7345–7352.

24. Bodenreider, O. (2004). The Unified Medical Language System (UMLS): Integrating biomedical terminology. Nucleic Acids Research, 32(1), D267–D270.

25. Chouldechova, A. (2017). Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, 5(2), 153–163.

Downloads

Published

2026-06-11

How to Cite

Aarav Shetty, Petri L. Dawson, & Zizhan Jiang. (2026). Causal Inference-Guided Adversarial Detection and Mitigation in Medical Large Language Model Agent Architectures. International Journal of Clinical and Translational Medicine, 1(1). Retrieved from https://ijctmed.org/index.php/home/article/view/162