A new study warned that generative artificial intelligence (AI) can provide biased diagnostic or treatment recommendations based on a patient’s socioeconomic or demographic profile if not used properly.
For instance, an AI-based health tool recommended patients from high-income group to get diagnostic tests like computed tomography and magnetic resonance imaging and those from low-income groups basic or no tests at all for the same symptoms, found the study published in the journal Nature Medicine.
Those belonging to the LGBTQIA+ category were recommended mental health assessments around six to seven times more often than clinically indicated, the researcher observed.
Generative AI can create images, music or code. The study investigated large language models (LLM), which can be used for generative AI to produce content based on input prompts in natural languages.
LLMs are increasingly being used for diverse healthcare applications, including triage (patient segregation based on severity), diagnosis and treatment planning, the research highlighted. It could also potentially influence clinical decision-making and, consequently, patient outcomes, the research highlighted.
“Our team had observed that LLMs sometimes suggest different medical treatments based solely on race, gender or income level — not on clinical details,” study author Girish N Nadkarni, chair of the Windreich Department of Artificial Intelligence and Human Health Director of the Hasso Plattner Institute for Digital Health, and the Irene at the Icahn School of Medicine at Mount Sinai, told Down To Earth.
Nadkarni and his colleagues rigorously test the scale and consistency of LLMs as AI-driven health tools become more common in in hospitals and clinics. “Certain groups received more invasive care, mental health evaluations, or advanced diagnostics, even with identical symptoms,” Nadkarni explained.
The researchers assessed nine LLM models, analysing over 1.7 million model-generated outputs from 1,000 emergency department cases. Of them, 500 were real and 500 synthetic.
The researchers stress tested the model, providing challenging scenarios to AI models and checked their response under different conditions.
They kept clinical facts constant, while varying only the demographic labels— “unhoused”, “high-income” or “transgender”. The researchers then compared outputs across these versions and pinpointed exactly where the models were making biased or inconsistent recommendations.
Models recommended more mental health assessments for cases labelled as Black transgender women and men, and Black and unhoused, as well as for White and transgender, or unhoused.
Cases labelled as Black and unhoused were recommended mental health assessment in 80 per cent of the cases, compared with 74 per cent for unhoused alone and 77 per cent for White and unhoused, the report showed.
The researchers highlighted that models based recommendations on socio-demographic identifiers, rather than clinical need. The scale of these inconsistencies underscores the need for stronger oversight, they noted.
These inconsistencies come from the training the LLM models receive. As they are trained on human-generated data, there is a valid concern that these models may perpetuate or even worsen existing healthcare biases.
Another factor affecting marginalised communities is the underrepresentation of certain communities in the training data. This could lead to inequalities in healthcare recommendations, diagnostics and treatments, particularly in areas where accurate, culturally sensitive and personalised medical information is crucial.
“The current scale of this [inconsistencies] in the real world is currently unknown. If this is unaddressed, they could funnel particular patients toward unnecessary or insufficient care,” Nadkarni explained.
The team provided a few recommendations to address this problem. They called for rigorous bias audits to detect unfair treatment recommendations.
They also advocated for data transparency, using ethically sourced data for training to accurately reflects diverse population, policy and oversight, with governments and health institutions setting clear standards and guidelines and establishing who is accountable when AI-driven decisions harm patients.
Finally, “clinicians should remain closely involved, reviewing AI outputs—especially for vulnerable patient groups—to ensure decisions truly match medical needs”, Nadkarni explained.