The new ARC-AGI-2 test is presented as a significant challenge for AI models, testing their true adaptability and efficiency.
Why is ARC-AGI-2 Tougher for AI Models?
According to the Arc Prize Foundation's blog, reasoning models like OpenAI’s o1-pro and DeepSeek’s R1 are barely scoring between 1% and 1.3% on ARC-AGI-2. Non-reasoning models, including GPT-4.5 and Claude 3.7 Sonnet, also hover around 1%. Human scores, however, average at 60% accuracy.
The test includes: - Visual puzzle tasks requiring generating correct grids based on patterns. - Adaptability that requires solving unique problems not encountered during training. - An efficiency metric evaluating not just correctness, but the path to the solution.
ARC-AGI-2 vs. ARC-AGI-1: Key Changes
François Chollet stated that compared to its predecessor, ARC-AGI-1, ARC-AGI-2 is superior in measuring a model’s true intelligence. The main difference is the shift from over-reliance on brute-force computation towards genuine intelligence and efficiency. An example is OpenAI’s o3 (low), which scored highly on ARC-AGI-1 but dropped significantly on ARC-AGI-2.
Importance of the New AI Benchmark
The ARC-AGI-2 test is a timely innovation, as there was a growing demand for more credible and novel benchmarks for evaluating AI's progress, especially in creativity. ARC-AGI-2 addresses identified shortcomings, contributing to a more accurate assessment of AI's capabilities.
The ARC-AGI-2 test represents a significant shift in the ability to measure and understand artificial general intelligence, emphasizing the ongoing challenges AI models face in reaching human-level intelligence.