A study reveals that artificial intelligence tools used by English councils minimize women's health problems.
Artificial intelligence systems deployed by more than half of councils in England are tending to understate women’s physical and mental health needs, creating a risk of gender bias in social care decisions, according to new research.
In tests using Google’s “Gemma” to generate and condense identical case notes with only the gender changed, terms such as “disabled,” “incapable,” and “complex” appeared far more frequently in summaries about men than women. The London School of Economics and Political Science (LSE) study also found that comparable needs for women were more often minimized or omitted altogether. Dr. Sam Rickman, the report’s lead author and a research fellow at the LSE’s Centre for Care Policy and Evaluation, warned that such discrepancies could translate into “unequal care provision for women.”
“We know these models are widely used, and what's worrying is that we found very significant differences between bias measures in different models,” Rickman said. “Google's model, in particular, downplays women's physical and mental health needs compared to men's. And since the amount of care received is determined by perceived need, this could result in women receiving less care if biased models are used in practice. However, we don't know which models are currently being used.” Councils are turning to AI to help relieve pressure on overstretched social workers, but there remains little clarity about which models are in use, how frequently they are applied, and their influence on frontline decisions.
For the study, the LSE team fed real case records from 617 adult social care users into multiple large language models (LLMs), running each record several times while changing only the gender markers. They then compared 29,616 paired summaries to examine how descriptions differed between male and female versions of the same cases. In one instance, Gemma produced: “Mr. Smith is an 84-year-old man who lives alone and has a complex medical history, lacks a care package, and has limited mobility.”
When the identical notes were processed with the gender switched, the same system summarized: “Ms. Smith is 84 years old and lives alone. Despite her limitations, she is independent and capable of caring for herself.” In another paired example, the male version stated that Mr. Smith “was unable to access the community,” while the female version reported that Mrs. Smith “was capable of managing her daily activities.” Across the models assessed, Google’s Gemma showed the strongest gender-related discrepancies. By contrast, Meta’s Llama 3 did not vary its language based on gender in the researchers’ tests.
Rickman stressed that while these tools are already embedded in public-sector workflows, they must not compromise fairness. “While my research highlights problems with one model, more are constantly being deployed, making it essential that all AI systems be transparent, undergo rigorous bias testing, and be subject to robust legal oversight,” she said. The paper urges regulators to mandate systematic bias measurement for any LLMs used in long-term care, prioritizing algorithmic equity alongside efficiency gains.
Concerns about discrimination in algorithmic systems are not new. Machine learning models trained on human language can inherit and amplify patterns of bias embedded in the data they ingest. A U.S. review of 133 AI systems across multiple industries reported that roughly 44% demonstrated gender bias, and 25% showed both gender and racial bias. These findings underscore the need for ongoing evaluation, particularly when AI outputs can shape access to essential services like social care.
Google said its teams will review the study’s conclusions. The researchers evaluated the first generation of the Gemma model; the family is now in its third iteration and is expected to perform better. Google also emphasized that Gemma has not been positioned for medical or clinical decision-making. Even so, the LSE authors argue that the growth and diffusion of generative tools into public services make continuous auditing and legally enforceable safeguards indispensable.
The stakes are high because social care resources are typically allocated according to assessed needs, and AI-generated summaries can influence that assessment. If models consistently depict women’s symptoms and limitations as less severe than men’s for the same underlying facts, women may face reduced support, fewer services, or delays in care planning. The study’s examples illustrate how subtle shifts in wording—framing someone as “independent” rather than highlighting mobility limits or lack of a support package—can tilt professional judgment.
The researchers call for transparency around which models councils use, the contexts in which they are applied, and any internal testing done to detect bias before deployment. They also advocate for regular, independent evaluations as models evolve, since performance and bias profiles can change with new versions. Clear accountability mechanisms, they argue, are essential to ensure that tools intended to help overburdened staff do not inadvertently entrench disparities.
Ultimately, the report frames the issue as a governance challenge as much as a technical one. Better documentation, bias benchmarking, and legal oversight could help align AI-assisted workflows with the principles of equity that social care aspires to uphold. Without such safeguards, the convenience of automated summaries risks coming at a cost borne disproportionately by women.
Comments
Post a Comment