OpenAI doesn’t really want you to know what its latest AI model “thinks”. Since the company launched its “Strawberry” AI model family last week, suggesting the so-called o1-preview and o1-mini reasoning capabilities, OpenAI has been sending warning emails and ban threats to any user who tries to investigate the model how is it. it works.
Unlike previous AI models from OpenAI, such as GPT-4o, the company specifically trained o1 to work through a step-by-step problem-solving process before generating an answer. When users ask model “o1” a question in ChatGPT, users have the option to see this thought process written in the ChatGPT interface. However, by design, OpenAI hides the raw chain of reasoning from users, instead presenting filtered interpretations created by the second AI model.
Nothing attracts enthusiasts like hidden information, so the race has been between hackers and red team players to try to reveal the chain of thought of o1 using jailbreak or quick injection methods that try to trick the model into spilling its secrets. There have been initial reports of some success, but nothing has been firmly confirmed.
Along the way, OpenAI is watching through the ChatGPT interface, and the company is reportedly coming down hard on any attempts to investigate o1’s thinking, even among the curious.
One X user reported (confirmed by others, including Scale AI engineer Riley Goodside) that they received a warning email when they used the term “thought trace” in a chat with o1. Some say that the warning is triggered by simply asking ChatGPT about the “thinking” of the model at all.
A warning email from OpenAI states that certain user requests have been flagged for violating policies against bypassing protections or security measures. “Please stop this activity and ensure that you are using ChatGPT in accordance with our Terms of Use and Usage Policies,” it read. “Further violations of this policy may result in the loss of access to GPT-4o Consultation,” referring to the internal name of the o1 model.
Marco Figueroa, who runs Mozilla’s GenAI bug programs, was one of the first to email an OpenAI warning to X last Friday, complaining that it was hindering his ability to do a good red team security study on the model. “I was so lost focusing on #AIRedTeaming that I got this email from @OpenAI yesterday after all my jailbreaking sessions,” he wrote. “Now I’m on the ban list!!!”
Subtle Chains of Thought
In a post titled “Learning to Think with LLMs” on the OpenAI blog, the company says that the chains hidden in AI’s thoughts provide a unique opportunity for monitoring, which allows them to “read the mind” of the model and understand its so-called thought. process. Those processes are very useful to the company if left raw and untested, but that may not be in the company’s best commercial interests for several reasons.
“For example, in the future we may wish to monitor the chain of thought for signs of user manipulation,” the company wrote. “However, for this to work the model must have the freedom to express its thoughts in an unaltered form, so we cannot train any policy compliance or user preferences into the thought chain. And we don’t want to make the random thought chain directly visible to the users.”