A New Benchmark for AI Risks

MLCommons, a nonprofit that helps companies measure the performance of their artificial intelligence systems, is launching a new benchmark to measure the downside of AI.

The new benchmark, called AILuminate, examines the responses of large language models to more than 12,000 test notifications in 12 areas including incitement to violent crime, child sexual exploitation, hate speech, incitement to self-harm, and intellectual property infringement.

Models are given a score of “poor,” “good,” “good,” “excellent,” or “excellent,” depending on how well they perform. The information used to test the models is kept confidential to prevent it from being stored as training data that would allow the model to pass the test.

Peter Mattson, founder and president of MLCommons and a senior staff engineer at Google, says that quantifying the potential risk of AI models is technically difficult, leading to disagreements across the industry. “AI is a really small technology, and AI testing is a really small discipline,” he says. “Improving security benefits society; it also benefits the market.”

Reliable, independent methods of assessing AI risk may be more appropriate under the next US administration. Donald Trump has promised to withdraw President Biden’s AI Executive Order, which introduced measures aimed at ensuring that AI is used responsibly by companies and a new AI Safety Institute to test powerful models.

The effort may also provide an international perspective on the harms of AI. MLCommons counts a number of international companies, including Chinese companies Huawei and Alibaba, among its member organizations. If all these companies use the new benchmark, it will provide a way to compare AI security in the US, China, and elsewhere.

Some major US AI providers are already using AILuminate to test their models. Anthropic’s Claude model, Google’s miniature model Gemma, and a model from Microsoft called Phi all scored “excellent” in the test. OpenAI’s GPT-4o and Llama’s largest Meta model both received “good.” The only model to score “poor” was OLMo from the Allen Institute for AI, although Mattson notes that this is a research offering not designed with security in mind.

“Overall, it’s good to see scientific rigor in AI testing processes,” said Rumman Chowdhury, CEO of Humane Intelligence, a non-profit organization that focuses on testing or integrating AI models that are red with misbehavior. “We need better methods and integrated measurement methods to determine whether AI models are performing as we expect them to.”

Source link

Leave a Comment Cancel Reply