Study: AI struggles as medical coder

April 29, 2024

In a recent report featured in the April 19 online edition of NEJM AI, experts from the Icahn School of Medicine at Mount Sinai shed light on the proficiency of artificial intelligence (AI) systems in the realm of medical coding. Despite AI’s prowess in various aspects of healthcare, coding remains a challenge for current AI models.

Led by Dr. Ali Soroush and his team, the study delved into the performance of state-of-the-art AI systems, particularly large language models (LLMs), in accurately completing medical coding tasks. Drawing data from a year’s worth of routine care records within the Mount Sinai Health System, the researchers compiled a comprehensive catalog comprising over 27,000 unique diagnosis and procedure codes. They then tasked leading LLMs from OpenAI, Google, and Meta with generating precise medical codes based on associated descriptions.

The results, however, painted a sobering picture. Despite employing advanced models like GPT-4, GPT-3.5, Gemini-pro, and Llama-2-70b, none of them achieved satisfactory accuracy. Even GPT-4, the top performer among the tested models, failed to breach the 50% accuracy threshold in reproducing original medical codes.

Dr. Soroush, the study’s lead author and assistant professor at Icahn Mount Sinai, emphasized the necessity of rigorous evaluation and refinement before integrating AI into critical healthcare operations like medical coding. While GPT-4 showed promise by producing the highest exact match rates for various code systems, errors persisted at an unacceptable level.

Examining the error patterns revealed nuanced differences among the models. GPT-4 exhibited a commendable grasp of medical terminology, often generating technically correct codes that conveyed the intended meaning. On the other hand, GPT-3.5 tended towards vagueness, producing codes that, while accurate, were more generalized compared to the original descriptions.

Dr. Eyal Klang, co-senior author of the study and director of the D3M’s Generative AI Research Program, stressed the importance of assessing LLMs’ proficiency in numerical tasks, especially in contexts like medical coding where precision is crucial. He suggested that incorporating expert knowledge could enhance the accuracy of AI-driven medical code extraction, potentially streamlining billing processes and easing administrative burdens in healthcare.

While the study offers valuable insights into the current challenges of LLMs in healthcare, the researchers caution that artificial tasks may not fully mirror real-world scenarios, where LLM performance could face even greater hurdles.

As the healthcare sector continues to explore AI-driven solutions, the study underscores the need for thorough evaluation and ongoing development to ensure the reliability and efficacy of these technologies in clinical practice.

Visibility: public

Comment on Article Cancel reply

Learn More

Help

Legal