OpenAI announces the launch of PaperBench, an AI Agent evaluation benchmark

On April 3rd, the Open Artificial Intelligence Research Institute (OpenAI) in the United States announced the launch of PaperBench on April 2nd local time – a benchmark for evaluating the ability of AI agents to reproduce cutting-edge AI research. The intelligent agent needs to reproduce 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding the contributions of the papers, developing code libraries, and successfully executing experiments. It is reported that after testing multiple cutting-edge models on PaperBench, it was found that the best performing agent, Claude 3.5 Sonnet (new version), combined with open-source frameworks, achieved an average reproduction score of 21.0%. In the end, it recruited top machine learning PhDs to try some of the test sets and found that the performance of the above models had not yet surpassed the human baseline.

Related News