Even some of the best AIs can’t beat this new benchmark

user

9 hours ago

Even some of the best AIs can’t beat this new benchmark

The nonprofit Center for AI Safety (CAIS) and AI Scale, a company that provides several data labels and AI development services, have released a challenging new benchmark for frontier AI systems. The benchmark, called the Humanities Final Exam, includes thousands of multiple-choice questions on subjects like math, humanities, and natural sciences. To make the evaluation more difficult, the questions are in many formats, including formats that contain diagrams and pictures. In preliminary research, none of the leading AI systems available to the public managed to score more than 10% on the Humanity Final Exam. CAIS and the AI Scale say they plan to open the benchmarks to the research community so researchers can “dig deeper into variations” and evaluate new AI models.

Source link