Daily Sunday, NPR hosts will be shortz, the fresh crop puzzle in New York Times, will get the thousands of listeners in a long segment called Sunday called Sunday. When it is written to be ignored without a forewledge too much, Brainteaser is usually challenging even when for skilled contestants. Therefore some experts think of the way I promise to try limits to solve ai problem. In new research, team researchers surrounded by Wellesley College, Oberlin College, University of Texas, and the University University, and the University University made the benchmark AI using the episode of the puzzle puzzle episodes. The team says that the test is unpleasant, like the meeting of the meeting – openai’s O1, among others – occasionally “and give the answers you know. “We want to develop a benchmark with human being able to know only public knowledge,” Arjun Guha, Faculty Computer Science in North-east and one of these A little AI industry about the benchmarking quandary at the moment. Most tests are usually used to evaluate AI model to try, as competence in the average mathematical and the average science, which does not meet the average user. Meanwhile, many benchmarks – and even benchmarks are released by recently closer to a saturated point. The benefits of the public radio queries are generally as the week is not trying to knowledge of the esoteric, and the challenge is perceived the model can be described, explaining Guha. “I think what makes the problem hard to make useful progress for problems until you complete – it’s all clicking all,” Guhalic said. “That requires a combination of insights and the removal process.” No perfect benchmark, of course. Sunday puzzle week is US Centric and England only. And because the quiz is available in public, it may be the model that is trained in him can “lie down” with the guy saying that he has not seen this evidence. “The new question was released every week, and we can expect the latest question to be invisible,” he said. “We want to maintain fresh benches and tracks my model’s model way.” The benchmark researchers, consisting of about 600 Sunday, consideration such as R1 and other rest furuses. Model experimentally to check themselves before giving results, which helps them avoid pitfalls that are usually trip model AI. The trading is the model model takes longer to achieve solutions – usually seconds to minutes longer. At least there is at least a model, R1 EFSEEK, gives you the wrong solution know that there is a puzzle question on the week. R1 will be the Verbatim country “I give up,” followed by the wrong answer to be selected randomly – this human behavior can be relevant. The model made another weird choice, as giving the wrong answer just to quickly retreat, try to tease the better, and failed again. He also stuck “thinks” forever and give them an existence for the answers, or come to the correct answer, but it cannot consider an unclear alternative answer. “In difficult matter, the R1 literally says that it’s gaining ‘frustration,'” Guha says. “You are funny to see how the model is said. Staying the ‘frustration’ with the quality of the model.” R1 Get “frustration on the puzzle day of week set. Credits: Guha et al. The best model is now on the benchmark is O1 with a 59%-mini-mini-minute attempt to be released “(R1 print 35%.) Is a plan to spread the trial for additional models, who hope it will be able to identify an area that can be found. The team model score on benchmark. , “Guhakan said.” The Branch’s signature with a wider access allows more researchers to know and analyze the results, which can cause better solutions in the future. Furthermore, as the model of these countries increased in the suitables that everyone can affect everyone, we believe that everyone should know what the model is – and not – no. ”