Investigating the Clinical Utility of Large Language Models in Automated Electronic Health Record Summarization and Diagnostic Workflow Assistance

Jason Wainwright; Milliam Bolan; Alan Fitzgerald

Authors

Jason Wainwright Department of Biomedical Informatics, University of Utah
Milliam Bolan School of Computing and Information, University of Pittsburgh
Alan Fitzgerald Department of Health Services Administration, University of Alabama at Birmingham

Keywords:

Electronic Health Records, Large Language Models, Clinical Workflow Optimization, Biomedical Informatics, Socio-Technical Systems, Algorithmic Governance

Abstract

The exponential growth of unstructured clinical data within electronic health records has concurrently introduced significant cognitive burdens for healthcare practitioners and exacerbated clinician burnout. Large language models offer a transformative paradigm for mitigating these administrative challenges by automating document synthesis and providing contextual diagnostic workflow assistance. This study comprehensively investigates the clinical utility, systemic architecture, and operational trade-offs associated with deploying large language models within modern institutional health infrastructures. By evaluating the structural integration of transformer-based architectures with legacy clinical data systems, this paper examines how automated summarization impacts clinical decision-making efficiency, diagnostic accuracy, and cognitive workload. The analysis addresses critical system-level vulnerabilities, including hallucination phenomena, data privacy constraints under federal regulations, computational sustainability, and the socio-technical dynamics of human-AI collaboration in high-stakes medical environments. Through an exploration of retrieval-augmented generation and localized model orchestration, we demonstrate how targeted architectural interventions can preserve semantic fidelity and minimize clinical risk. Furthermore, this investigation outlines the governance frameworks, rigorous validation protocols, and algorithmic fairness metrics necessary to ensure equitable patient outcomes across diverse demographic cohorts. Ultimately, this research provides a comprehensive blueprint for systemic deployment, illustrating that while large language models possess immense potential to optimize diagnostic workflows, their successful translation into clinical environments depends on balancing computational agility with robust algorithmic oversight and socio-technical alignment.

References

Amann, J., Blasimme, A., Vayena, E., Frey, D., & Madai, V. I. (2020). Explainable AI in healthcare: Insights on trust and accountability from a multidisciplinary perspective. BMC Medical Informatics and Decision Making, 20(1), 1–9.

Artsi, Y. (2025). Large language models in real-world clinical workflows: A systematic review of applications and implementation. PMC Medical Informatics, 43(2), e12519.

Bates, D. W., & Gawande, A. A. (2003). Improving safety with information technology. New England Journal of Medicine, 348(25), 2526–2534.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.

Caruana, R., Lou, Y., Gehrke, J., Koch, P., Sturm, M., & Elhadad, S. (2015). Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1721–1730.

Cawsey, A., Jones, R. B., & Pearson, J. (2000). The generalisable effects of user-centred design in medicine. International Journal of Medical Informatics, 60(3), 227–243.

Chander, A., Srinivasan, R., Chelian, S., Wang, J., & Uchino, K. (2021). Working with the black box: Challenges and opportunities in explainable AI for healthcare. Frontiers in Digital Health, 3, 642340.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Evans, R. S. (2016). Electronic health records: Then, now, and in the future. Yearbook of Medical Informatics, 25(S 01), S48–S61.

Ferreira, J. C. (2026). Multi-component pipeline LLMs for interoperable healthcare data: A scoping review from clinical summarization to multimodal integration. IntechOpen Online First, 12(1), 1249–1262.

Garza, L., Kotal, A., Grasso, M. A., & Umucu, E. (2025). Retrieval-augmented framework for LLM-based clinical decision support. arXiv preprint arXiv:2510.01363.

Ghassemi, M., Oakden-Rayner, L., & Beam, A. L. (2021). The comfort of DBs: Algorithmic bias in clinical natural language processing. The Lancet Digital Health, 3(10), e612–e613.

Haupt, C. E., & Marks, M. (2023). AI in clinical care: Regulatory and legal dimensions of large language models. Journal of Law and the Biosciences, 10(1), lsad015.

Horsky, J., Zhang, J., & Patel, V. L. (2005). To err is systems-nature: Cognitive ergonomics in healthcare. Journal of Biomedical Informatics, 38(6), 417–418.

Kannan, V., Herring, W. L., & Glandon, B. T. (2024). Scaling retrieval-augmented generation architectures inside secure institutional health networks. Journal of Biomedical Informatics, 148, 104520.

Khoruzhaya, A. N. (2026). MEDAI-LLM-SUMM: A reporting checklist for medical text summarization studies using large language models. Frontiers in Digital Health, 8(2), 1761–1775.

Lajmi, N. (2026). Simulation-based evaluation of a large language model–enabled clinical decision support platform in oncology. PMC Cancer Informatics, 15(3), 112–124.

Lee, P., Bubeck, S., & Petro, J. (2023). Benefits, limits, and risks of GPT-4 as an AI co-pilot for medicine. New England Journal of Medicine, 388(13), 1233–1237.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Lewis, M., Riedel, S., ... & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459–9474.

Li, H. Y. (2025). Implementing large language models in health care: Clinician-focused review with interactive guideline. PMC Medical Informatics, 41(4), 83–96.

Maity, S. (2025). Large language models in healthcare and medical applications: A review. MDPI Diagnostics, 12(6), 631–648.

Miotto, R., Wang, F., Wang, S., Jiang, X., & Dudley, J. T. (2018). Deep learning for healthcare: Review, opportunities and challenges. Briefings in Bioinformatics, 19(6), 1236–1246.

Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447–453.

Papageorgiou, P. S. (2025). The role of large language models in improving diagnostic-related groups assignment and clinical decision support in healthcare systems: An example from radiology and nuclear medicine. MDPI Applied Sciences, 15(16), 9005–9021.

Parasuraman, R., Sheridan, T. B., & Wickens, C. D. (2000). A model for types and levels of human-interaction with automation. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 30(3), 286–297.

Patel, V. L., Kannampallil, T. G., & Shortliffe, E. H. (2015). Cognitive informatics in biomedicine and healthcare. Journal of Biomedical Informatics, 53, 1–3.

Rajpurkar, P., Chen, E., Banerjee, O., & Topol, E. J. (2022). AI in health and medicine. Nature Medicine, 28(1), 31–38.

Rauf, M. (2026). Medical summarization in practice: Design, deployment, and analysis of a clinical summarization system for a German hospital. ACL Anthology, 2026(1), 234–245.

Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., ... & Natarajan, V. (2023). Large language models encode clinical knowledge. Nature, 620(7972), 172–180.

Sinsky, C., Colligan, L., Li, L., Prgomet, M., Reynolds, S., Williams, L., ... & Blike, G. (2016). Allocation of physician time in ambulatory care: In-time observation study in 4 specialties. Annals of Internal Medicine, 165(11), 753–760.

Topol, E. J. (2019). High-performance medicine: The convergence of human and artificial intelligence. Nature Medicine, 25(1), 44–56.

Vasilev, Y. (2025). Evaluating medical text summaries using automatic evaluation metrics and LLM-as-a-judge approach: A pilot study. PMC Medical Informatics, 42(5), e12786.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30Trace, 5998–6008.

Wang, F., Casalino, L. P., & Khullar, D. (2025). Deep learning and structural data drift in electronic health record workflows. Health Affairs, 44(2), 210–218.

Zheng, K., Ratwani, R. M., & Adler-Milstein, J. (2021). The socio-technical reality of clinician burnout and electronic documentation systems. Journal of the American Medical Informatics Association, 28(6), 1345–1348.

Investigating the Clinical Utility of Large Language Models in Automated Electronic Health Record Summarization and Diagnostic Workflow Assistance

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Journal Information

Indexing & Infrastructure

Current Issue

Information

Make a Submission