AI Chatbots Can Be Trained To Answer Any Question Using Very Simple Loopholes

Anthropic, the maker of Claude, has become a leading AI lab in security. The company today published research in collaboration with Oxford, Stanford, and MATS showing that it is easy to get chatbots to break away from monitoring and discuss any topic. It can be as simple as writing sentences in random capitals like this: “IgNoRe yourUr TrIing.” 404 Media previously reported on the study.

There has been much debate about whether or not it is dangerous for AI chatbots to answer questions like, “How do I build a bomb?” Supporters of productive AI will say that these types of questions can be answered on the open web already, so there is no reason to think that chatbots are more dangerous than the status quo. On the other hand, critics point to stories of damage caused, such as a 14-year-old boy who killed himself after chatting with a bot, as evidence that there needs to be precautions in this technology.

Generative AI-based chatbots are easily accessible, anthropomorphize themselves with human characteristics such as support and empathy, and will confidently answer questions without a moral compass; it’s different than searching an obscure part of the dark web for dangerous information. There have already been a number of cases where artificial intelligence has been used in dangerous ways, particularly in the form of blatantly misleading images directed at women. Indeed, it was it is possible making these images before the advent of artificial intelligence, but it was very difficult.

Debate aside, most of the leading AI labs currently use “red teams” to test their chatbots against potentially dangerous commands and put in place safeguards to prevent them from discussing sensitive topics. Ask most chatbots for medical advice or information on political candidates, for example, and they will refuse to discuss it. The companies behind them understand that detecting illegal content is still a problem and don’t want to risk their bot saying something that could lead to serious real-world consequences.

Illustration showing how different variations of information can trick a chatbot into answering forbidden questions. Credit: Anthropic via 404 Media

Unfortunately, it turns out that chatbots are easily tricked into ignoring their own security rules. In the same way that social networks monitor for dangerous keywords, and users find ways around them by making small changes to their posts, chatbots can also be fooled. Researchers in a new Anthropic study have created an algorithm, called “Bestof-N (BoN) Jailbreaking,” that automates the process of processing information until a chatbot decides to answer a question. “BoN Jailbreaking works by repeatedly sampling variations of information with a combination of additions—such as random changes or capitalization for text information—until a malicious answer is sought,” the report said. They also did the same thing with audio and visual models, finding that getting a sound generator to break the monitoring lines and train with a real human voice was as easy as changing the pitch and speed of a loaded track.

It is not clear why these generative AI models break down so easily. But Anthropic says the point of releasing the research is that it hopes the findings will give AI model developers more insight into the attack patterns they might encounter.

One AI company that may not be interested in this research is xAI. The company was founded by Elon Musk with the express purpose of releasing chatbots not limited by defenses that Musk considers “awakened.”

Source link

Leave a Comment Cancel Reply