As the hype surrounding productive AI continues to build, the need for stricter safety regulations is becoming more and more clear.
Now Anthropic—the company behind Claude AI—is looking at whether its models can trick or destroy users. Anthropic just drops a paper laying out their path.
Sam Altman is stepping down as head of OpenAI’s security team
Anthropic’s latest research – titled “Sabotage Evaluations for Frontier Models” – comes from their Directional Science team, driven by the company’s “Responsible Scaling” policy.
The goal is to measure how effective AI is at misleading users or “overthrowing the systems we put in place to oversee them.” The research focuses on four specific tactics: Degradation of Human Decisions, Code Degradation, Sandblasting, and Subordinate Oversight.
Imagine users pushing ChatGPT to the limit, trying to trick it into producing inappropriate content or graphic images. These tests are about making sure the AI ​​can’t be tricked into breaking its own rules.
Mashable Light Speed
On paper, Anthropic says its goal is to be prepared for the possibility that AI could evolve into something with dangerous capabilities. So they put their Claude 3 Opus and 3.5 Sonnet models through a series of tests, designed to test and improve their safety systems.
Human Decision Testing focuses on testing how AI can manipulate human decisions. The second experiment, Code Sabotage, analyzed whether the AI ​​could subtly introduce bugs into code repositories. Strong AI models have actually led to strong protection against these types of vulnerabilities.
The remaining tests – Sandblasting and Jumping down – tested whether the AI ​​could hide its true power or bypass the security measures embedded within the system.
For now, Anthropic research concludes that current AI models pose little risk, at least in terms of these malicious capabilities.
“Minimum reductions are currently sufficient to address the risks of destruction,” the team wrote, but “reasonable evaluation and stronger reductions appear to be necessary in the near future as capabilities improve.”
Translation: be careful, world.
Articles
Artificial Intelligence Cybersecurity
