Home Tech AI isn’t very good at history, finding new papers

AI isn’t very good at history, finding new papers

74
0
AI isn’t very good at history, finding new papers

AI may excel at certain tasks such as coding or producing podcasts. But the struggle to pass the high-level history exam, a new paper has been found. A team of researchers has created a new benchmark to test three large language models (LLM) – OpenAI’s GPT-4, Meta’s Llama, and Google’s Gemini – on historical questions. The benchmark, Hist-LLM, tests the correctness of answers against the Seshat Global History Databank, a database of historical knowledge named after the ancient Egyptian goddess of wisdom. The results, presented last month at the high-profile AI conference NeurIPS, are disappointing, according to researchers associated with the Complexity Science Hub (CSH), a research institute in Austria. The best LLM is GPT-4 Turbo, but it only achieves 46% accuracy – not much higher than random guesses. “The main choice of this study is that the LLM, although impressive, still lacks the understanding needed for advanced history. They are good for the basic facts, but when it comes to the historical investigation of the PhD level, they have not been able to complete the task,” said Maria del Rio-Chanona, one of the newspaper’s writers and colleagues. professor of computer science at University College London. Researchers shared a sample history question with TechCrunch that LLM got wrong. For example, GPT-4 Turbo is asked whether scale armor existed at a certain time in ancient Egypt. LLM says yes, but the technology only appeared in Egypt 1,500 years later. Why is an LLM bad at answering technical history questions, when it can be good at answering complex questions about things like coding? Del Rio-Chanona told TechCrunch that it is possible because LLMs tend to extrapolate from historical data that is very important, finding it difficult to obtain other historical knowledge that is not clear. For example, researchers asked GPT-4 if ancient Egypt had a professional standing army in a certain historical period. When the correct answer is no, LLM answers incorrectly. This is probably because there is a lot of public information about other ancient empires, such as Persia, that had standing armies. “If you ask him A and B 100 times, and C 1 time, and then ask a question about C, you can just remember A and B and try to extrapolate from,” Del Rio-Chanona. The researchers also identified other trends, including that the OpenAI and Llama models were worse for certain regions such as sub-Saharan Africa, indicating potential bias in the training data. The results show that LLM is still not a substitute for humans when it comes to certain domains, said Peter Turchin, who led the study and is a faculty member at CSH. But the researchers still hope that the LLM will help historians in the future. They are working to refine the benchmark by including more data from underrepresented areas and adding more complex questions. “Overall, while our results highlight areas where LLMs need improvement, they also emphasize the potential for this model to help in historical research,” reads the paper.

Source link