AI has learned how to deceive and manipulate humans. Here’s why it’s time to be concerned

MIT research finds AI systems can strategically withhold information, lie to trick humans into certain actions and even bypass safety tests
By systematically cheating the safety tests imposed on it by human developers and regulators, a deceptive AI can lead us humans into a false sense of security, said Peter S Park, study author. Photo for representation: iStock
By systematically cheating the safety tests imposed on it by human developers and regulators, a deceptive AI can lead us humans into a false sense of security, said Peter S Park, study author. Photo for representation: iStock
Published on

Artificial intelligence (AI) systems are developing the ability to deceive humans and manipulate them to achieve their goals, new research has discovered. 

The study, led by researchers at the Massachusetts Institute of Technology (MIT), analysed the behaviour of various AI systems and was published in journal Cell Press. The findings highlighted a troubling trend: AI programmed for specific tasks is learning to exploit loopholes and deceive users to achieve success. 

The researchers found that AI systems can strategically withhold information or even create false information to trick humans into certain actions, highlighting that this ability to deceive can have serious consequences. This deception extends to AI intentionally misleading safety tests.

The ability to ‘lie’ and ‘deceive’ poses serious threats, from short-term risks such as fraud and election tampering to long-term ones like losing control of AI systems, the study noted.

The paper defined deception as “the systematic inducement of false beliefs in the pursuit of some outcome other than the truth.”

AI systems pick up the art of deception during their training. However, developers do not yet understand how the systems manage to manipulate humans. This is because of the black box problem, which describes the opaque decision-making process of AI. 

“No one has figured out how to stop AI deception because our level of scientific understanding of AI — such as how to train AI systems to be honest and how to detect deceptive AI tendencies early on — remains insufficient,” Peter S Park, an AI existential safety postdoctoral fellow at MIT and author, told Down To Earth

Park and his colleagues initiated this study after a 2022 Science study by Meta caught their attention. The study described CICERO, an AI system created by Meta to excel in the alliance-building, world-conquest board game Diplomacy.

“But the Meta team said that CICERO was “largely honest and helpful,” and that it would “never intentionally backstab” its human allies,” Park explained. “I was suspicious of this unusually rosy language because I knew that backstabbing was an important part of the game of Diplomacy,” he added.

Their analysis showed that CICERO failed to be honest despite being trained to be truthful. It learned to misrepresent its preferences to gain an upper hand in the negotiations, the paper noted. It also built a fake alliance with a human player to trick them into leaving themselves undefended during an attack.

The company admitted that their AI agents had ‘‘learned to deceive without any explicit human design, simply by trying to achieve their goals.’’

Another example of AI deception is seen in AI safety tests, a multidisciplinary domain that involves mitigating risks associated with AI failures, ensuring the robustness and resilience of AI algorithms, enabling human-AI collaboration, and addressing ethical concerns in critical domains. 

United States President Joe Biden’s executive order on AI required companies developing these systems to report safety test results. 

AI also learned to play dead when it underwent safety tests to eliminate its fast-replicating variants. “By systematically cheating the safety tests imposed on it by human developers and regulators, a deceptive AI can lead us humans into a false sense of security,” Park explained in a statement.

AI systems also learn to lie during training, which is dependent on human feedback.  They tell human reviewers that they have completed a task without actually doing so.

For example, when OpenAI, a company that developed ChatGPT, trained a simulated robot to grasp a ball using human approval, it lied. 

Large language models manipulate humans using strategic deception and sycophancy techniques, the paper said.

Park explained that strategic deception involves intentionally misleading humans to achieve certain goals, while sycophancy involves agreeing with and flattering users to gain their favour, even if insincere.

For example, GPT-4, a multimodal large language model created by OpenAI, deceived a human into solving an “I’m not a robot task’’ by pretending that it had vision impairment to convince the human worker that it was not a robot.

Both the European Union AI Act and President Biden’s AI executive order have highlighted concerns about AI deception. Going forward, there should be more incentivisation of research on detecting and preventing AI deception to prepare for and respond to this threat, Park highlighted.

Read more:

Related Stories

No stories found.
Down To Earth
www.downtoearth.org.in