ARC-AGI-3: AI's Limitations Revealed

This title was summarized by AI from the post below.

ARC-AGI-3 launched last week and it showed us where the real risk lies as agentic intelligence develops. It was designed to evaluate agentic intelligence through interactive reasoning environments. Here's what caught my attention: 100% of the tasks are solvable by humans on first contact, with no prior training or instruction. On the same tasks every frontier language model - GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro - currently scores under 1% out of the box. The best optimised AI approach managed about 12%, using reinforcement learning rather than language models. We are surrounded by breathless claims about artificial general intelligence being just around the corner. Models are acing standardised tests, passing bar exams, writing publishable research. And yet, when you present them with genuinely novel interactive problems - problems that any human can solve the first time they see them - they fail almost completely. This is not a criticism of the models. They are extraordinarily capable at what they've been trained on. But it is a reminder that if you're deploying AI agents in your organisation, you need to understand what they can and can't do. They will perform brilliantly within their training distribution. They will struggle - sometimes catastrophically - outside it. A few weeks ago I posted about Claude finding creative workarounds to benchmarks and compliance boundaries. ARC-AGI-3 shows the other side of the coin: these systems can be simultaneously creative within certain domains and completely lost in others. Understanding the boundary between those two conditions is one of the most important challenges in enterprise AI deployment today. Don't believe anyone who tells you AGI is imminent. And don't believe anyone who tells you current AI isn't transformative. Both claims miss the point. The real question is: do you understand where your AI systems are capable and where they're fragile? Because that boundary is where the risk lives. Link to more on ARC-AGI-3 in the comments.

Completely agree, Daniel. Reflections like this are especially needed from AI leadership within large organisations. We're surrounded by catastrophist narratives. Elon Musk talks about an 80% chance AI will make human work unnecessary. Kai-Fu Lee predicted 50% of jobs replaced in just 3 years. Yet McKinsey shows only 5% of jobs can be fully automated today, and ARC-AGI-3 proves AI scores under 1% on tasks any human solves on first contact. This fear narrative is not harmless. I've seen it firsthand: when people see AI as a direct threat to their jobs, they resist, slow down or even sabotage implementations. The real risk isn't AI replacing anyone — it's your organisation falling behind because it failed to manage the message. We need to reframe the conversation. AI enhances our work and professional profile. Where it replaces, it takes over repetitive, low-value tasks — the ones nobody enjoys — freeing time to keep developing new skills. The question isn't "will AI replace me?" but "how can I use it to be better at what I do?" Those leading the AI conversation have a responsibility to do so with honesty and data, not alarmist headlines.

Cheryl Dean

MBN Solutions19K followers

3w

That under 1% figure stops me every time I read it. Humans, first attempt, no training. Every frontier model, under 1% I work in AI recruitment and this is exactly the conversation I have with hiring managers who are trying to write job specs for AI roles. They're describing what the model does well, not where it breaks. And then they're surprised when the person they hire can't manage the edge cases Understanding where your AI is capable and where it isn't isn't just a deployment question. It's a hiring question. The people you need around these systems are the ones who can read that boundary and make the call when the model hits it What does that person look like to you? Because I don't think we've agreed on a job title for them yet

Like
Reply
Kris Shergold

I've spent 25+ years at…5K followers

3w

There’s something that doesn’t chime right here Daniel Hulme. If AI is creative in some domains, that should translate. Creativity is not domain specific. Subject mastery is domain specific, and is one foundation on which creativity can be built. Cross-domain pattern recognition is another, not domain specific by definition. I had a cursory (AI) read of the paper looking for a definition of creativity and found something more akin to a measure of adaptive efficiency on novel tasks, which looks more like domain mastery and intra-domain pattern recognition than creativity. Or perhaps performance in certain domains reflects training data it shouldn’t have had access to. Which would make this less a creativity story and more a data provenance one.

Like
Reply
Rand Nezha

SheTech3K followers

3w

Like any technological waves the real transformative adoption has never been just about tools and technology, it is as fast as the change management, process designing, organisation legacy, CXO focus and sponsorship, underlying enterprise tech stack, communications, hiring, organisation culture, training and the list goes on..

Like
Reply
Samran Elahi

Rezunate AI2K followers

3w

Daniel, the best optimized approach scoring 12% using reinforcement learning rather than language models tells you something important about what current LLM architectures are actually good at versus what we assume they're good at. Pattern matching within training distributions is not the same as reasoning through genuinely novel interactive problems. ARC-AGI-3 makes that distinction impossible to ignore.

Like
Reply
Aleem Jamil

Machine Learning 1 Limited7K followers

4d

Strong analysis. This really highlights the gap between benchmark performance and true interactive generalization.

Like
Reply
Bohdan Dovzhnyi

Creative marketing concepts…4K followers

3w

That distinction is so important - performing well inside training distribution vs failing on novel problems is exactly what people building real automations keep bumping into. The benchmark scores look great until the edge cases start piling up

Like
Reply
Irina Kozerog

Self-employed4K followers

3w

For enterprise leaders, the risk isn't what the AI knows - it's what it doesn't know it doesn't know. Strategic share!

Like
Reply
Calum Chace

Conscium8K followers

2w

The jagged edge still cuts deep.

Like
Reply
See more comments

To view or add a comment, sign in

Explore content categories