Organizations that develop mathematical benchmarks for AI did not announce that they had received funding from OpenAI until recently, prompting accusations of inappropriateness from some in the AI community. Epoch AI, a nonprofit primarily funded by Open Philanthropy, a research and grant foundation, announced on December 20 that OpenAI has supported the creation of FrontierMath. FrontierMath, a test with expert-level problems designed to measure AI math skills, is one of the benchmarks OpenAI is using for its upcoming flagship AI demo, o3. In a post on the LessWrong forum, an Epoch AI contractor with the username “Meemi” said that many contributors to the FrontierMath benchmark were not informed of OpenAI’s involvement until it was announced. “Communication about this is not transparent,” Meemi wrote. “In my opinion, Epoch AI should announce OpenAI funding, and contractors should have transparent information about the potential of the work used for the capabilities, when choosing to work on benchmarks.” On social media, some users expressed concern that the secrecy could damage FrontierMath’s reputation as an objective benchmark. In addition to supporting FrontierMath, OpenAI has access to many problems and solutions in the benchmark – a fact Epoch AI was not revealed before December 20, when o3 was announced. In a reply to Meemi’s post, Tamay Besiroglu, associate director of Epoch AI and one of the founders of the organization, insisted that the integrity of FrontierMath was not compromised, but admitted that Epoch AI “made a mistake” by not doing more. transparent. “We were limited from opening partnerships until around the time o3 was launched, and behind that we had to negotiate harder to be transparent to benchmark contributors as soon as possible,” Besiroglu wrote. “Our mathematicians deserve to know who can access their work. Although we are limited by contract on what we can say, we have to make transparency with contributors a non-negotiable part of OpenAI. Besiroglu added that OpenAI has access to FrontierMath, has “agreement verbal” with Epoch AI to avoid using FrontierMath problems to train AI. (Training AI in FrontierMath will be the same as teaching to a test.) Epoch AI also has a “separate holdout set” that serves as additional protection for the verification of a FrontierMath benchmark, Besiroglu said. “OpenAI has … fully supported our decision to maintain a separate set,” Besiroglu wrote. However, muddying the waters, Epoch AI lead mathematician Ellot Glazer noted in a post Reddit that Epoch AI has not been able to independently verify OpenAI’s FrontierMath o3 results [OpenAI’s] valid scores (that is, they did not train on the dataset), and they have no incentive to lie about their internal benchmarking performance,” said Glazer. “However, we cannot guarantee them until the independent evaluation is completed.” The saga is another example of the challenges of developing empirical benchmarks to evaluate AI – and secure the resources needed for benchmark development without creating the perception of a conflict of interest.