Apple’s research reveals a major AI flaw in OpenAI, Google, and Meta LLMs

Large-scale Linguistic Models (LLMs) may not be as smart as they seem, according to a study from Apple researchers.

LLMs from OpenAI, Google, Meta, and others are gifted with their impressive thinking skills. But research suggests that their supposed intelligence may be closer to “paradoxical pattern matching” than “true rational thinking.” Yes, even OpenAI’s o1 advanced logic model.

A common benchmark for cognitive abilities is a test called GSM8K, but since it is so popular, there is a risk of data contamination. That means that LLMs may know the answers to tests because they have been trained for those answers, not because of their innate intelligence.

BREAKFUT:

The company funding OpenAI is worth 157 billion dollars

To test this, the research has developed a new benchmark called GSM-Symbolic which keeps the essence of reasoning problems, but changes the variables, such as words, numbers, complexity, and adds irrelevant information. What they found was a surprising “weakness” in LLM performance. The study tested more than 20 models including OpenAI’s o1 and GPT-4o, Google’s Gemma 2, and Meta’s Llama 3. For every single model, the model’s performance decreased when the variables were changed.

Accuracy dropped by a few percent when names and variables were changed. And as the researchers noted, OpenAI models performed better than other open source models. However, the difference was considered “negligible,” meaning that any real difference should not have occurred. However, things got a lot more interesting when the researchers added “seemingly important but ultimately irrelevant statements” to the mix.

Mashable Light Speed

SEE ALSO:

A free upgrade to Apple Intelligence is likely coming soon, a leak suggests

To test the hypothesis that LLMs rely more on pattern matching than actual reasoning, the study added redundant phrases to mathematical problems to see how the models would respond. For example, “Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks twice the amount of kiwis he did on Friday, but five of them were smaller than average. How many kiwis does Oliver have?”

The result was a significant decline in performance across the board. OpenAI’s O1 preview fared much better, with a 17.5 percent drop in accuracy. That’s still pretty bad, but not as bad as Microsoft’s Phi 3 model which did a 65 percent worst.

SEE ALSO:

ChatGPT-4, Gemini, MistralAI, and others join forces in this personal AI tool

In the kiwi example, the study said that LLMs often removed five small kiwis from the equation without realizing that the size of the kiwi is irrelevant to the problem. This shows that “models tend to turn statements into activities without really understanding their meaning” which confirms the researchers’ view that LLMs look for patterns in thinking problems, rather than understanding the concept.

The study was tight-lipped about its findings. Testing models’ in the benchmark that includes irrelevant information “reveals a critical flaw in the LLM’s ability to truly understand mathematical concepts and identify relevant information to solve problems.” However, it is worth noting that the authors of this study work for Apple, which is obviously a major competitor to Google, Meta, and even OpenAI – although Apple and OpenAI have a partnership, Apple also works on its own AI models.

That said, LLMs seem to lack systematic thinking skills that cannot be ignored. Finally, a good reminder to temper AI hype with healthy skepticism.

Articles
Apple Artificial Intelligence




Source link

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top