ARC-AGI-3: AI's Limitations Revealed

This title was summarized by AI from the post below.

WPP•47K followers

ARC-AGI-3 launched last week and it showed us where the real risk lies as agentic intelligence develops. It was designed to evaluate agentic intelligence through interactive reasoning environments. Here's what caught my attention: 100% of the tasks are solvable by humans on first contact, with no prior training or instruction. On the same tasks every frontier language model - GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro - currently scores under 1% out of the box. The best optimised AI approach managed about 12%, using reinforcement learning rather than language models. We are surrounded by breathless claims about artificial general intelligence being just around the corner. Models are acing standardised tests, passing bar exams, writing publishable research. And yet, when you present them with genuinely novel interactive problems - problems that any human can solve the first time they see them - they fail almost completely. This is not a criticism of the models. They are extraordinarily capable at what they've been trained on. But it is a reminder that if you're deploying AI agents in your organisation, you need to understand what they can and can't do. They will perform brilliantly within their training distribution. They will struggle - sometimes catastrophically - outside it. A few weeks ago I posted about Claude finding creative workarounds to benchmarks and compliance boundaries. ARC-AGI-3 shows the other side of the coin: these systems can be simultaneously creative within certain domains and completely lost in others. Understanding the boundary between those two conditions is one of the most important challenges in enterprise AI deployment today. Don't believe anyone who tells you AGI is imminent. And don't believe anyone who tells you current AI isn't transformative. Both claims miss the point. The real question is: do you understand where your AI systems are capable and where they're fragile? Because that boundary is where the risk lives. Link to more on ARC-AGI-3 in the comments.

13 Comments

Daniel Hulme

WPP•47K followers

https://arxiv.org/abs/2603.24621

Eduardo Cuesta Donaire

WPP•1K followers

Completely agree, Daniel. Reflections like this are especially needed from AI leadership within large organisations. We're surrounded by catastrophist narratives. Elon Musk talks about an 80% chance AI will make human work unnecessary. Kai-Fu Lee predicted 50% of jobs replaced in just 3 years. Yet McKinsey shows only 5% of jobs can be fully automated today, and ARC-AGI-3 proves AI scores under 1% on tasks any human solves on first contact. This fear narrative is not harmless. I've seen it firsthand: when people see AI as a direct threat to their jobs, they resist, slow down or even sabotage implementations. The real risk isn't AI replacing anyone — it's your organisation falling behind because it failed to manage the message. We need to reframe the conversation. AI enhances our work and professional profile. Where it replaces, it takes over repetitive, low-value tasks — the ones nobody enjoys — freeing time to keep developing new skills. The question isn't "will AI replace me?" but "how can I use it to be better at what I do?" Those leading the AI conversation have a responsibility to do so with honesty and data, not alarmist headlines.

1 Reaction

Cheryl Dean

MBN Solutions•19K followers

That under 1% figure stops me every time I read it. Humans, first attempt, no training. Every frontier model, under 1% I work in AI recruitment and this is exactly the conversation I have with hiring managers who are trying to write job specs for AI roles. They're describing what the model does well, not where it breaks. And then they're surprised when the person they hire can't manage the edge cases Understanding where your AI is capable and where it isn't isn't just a deployment question. It's a hiring question. The people you need around these systems are the ones who can read that boundary and make the call when the model hits it What does that person look like to you? Because I don't think we've agreed on a job title for them yet

Kris Shergold

I've spent 25+ years at…•5K followers

There’s something that doesn’t chime right here Daniel Hulme. If AI is creative in some domains, that should translate. Creativity is not domain specific. Subject mastery is domain specific, and is one foundation on which creativity can be built. Cross-domain pattern recognition is another, not domain specific by definition. I had a cursory (AI) read of the paper looking for a definition of creativity and found something more akin to a measure of adaptive efficiency on novel tasks, which looks more like domain mastery and intra-domain pattern recognition than creativity. Or perhaps performance in certain domains reflects training data it shouldn’t have had access to. Which would make this less a creativity story and more a data provenance one.

Rand Nezha

SheTech•3K followers

Like any technological waves the real transformative adoption has never been just about tools and technology, it is as fast as the change management, process designing, organisation legacy, CXO focus and sponsorship, underlying enterprise tech stack, communications, hiring, organisation culture, training and the list goes on..

Samran Elahi

Rezunate AI•2K followers

Daniel, the best optimized approach scoring 12% using reinforcement learning rather than language models tells you something important about what current LLM architectures are actually good at versus what we assume they're good at. Pattern matching within training distributions is not the same as reasoning through genuinely novel interactive problems. ARC-AGI-3 makes that distinction impossible to ignore.

Aleem Jamil

Machine Learning 1 Limited•7K followers

Strong analysis. This really highlights the gap between benchmark performance and true interactive generalization.

Bohdan Dovzhnyi

Creative marketing concepts…•4K followers

That distinction is so important - performing well inside training distribution vs failing on novel problems is exactly what people building real automations keep bumping into. The benchmark scores look great until the edge cases start piling up

Irina Kozerog

Self-employed•4K followers

For enterprise leaders, the risk isn't what the AI knows - it's what it doesn't know it doesn't know. Strategic share!

Calum Chace

Conscium•8K followers

The jagged edge still cuts deep.

See more comments

To view or add a comment, sign in

More Relevant Posts

Christian Wende

DevBoost•6K followers
2w
Report this post
A few days ago, we were asked what a good AI solution looks like. It’s obvious that a chat interface that uses the company colors is not the answer. But what is the answer? We thought a bit about it and came up with a simple sketch that made things clearer for me. Maybe you find it useful too? On the bottom there are the infamous Large Language Models. OpenAI GPT, Claude, Mistral, DeepSeek – pick whatever you like. It’s your foundation. They are smart and useful, but they don’t know anything about your company, your domain, your work, the tools you use and which tasks are repetitive. And these are things we need to add on top of the LLMs. A good AI solution provides the LLM with the proprietary knowledge as context. Wherever you store your knowledge, it needs to be fed to the LLM when needed. A good AI solution provides the LLM with prompts (or skills) that are tested and created for the tasks that need to get done in the company. The LLM needs to know how you want to get things done. A good AI solution integrates the LLM functionality in the existing tools that are used every day. No Copy&Paste. No switching between apps. Just one click and the AI gets stuff done. It needs to be as easy as possible. A good AI solution automates your repetitive tasks using agents. Agents that work – if possible, out of the box. No need to configure and prompt the agents yourself. They just run. Doing their thing. Maybe asking for feedback and approval occasionally. But this is just a simple sketch. What do you think?
2 Comments
Like Comment
To view or add a comment, sign in
Navin Nagarjunapu

AiFA Labs•2K followers
1w
Report this post
🤖 LLM-as-a-Judge: The Future of Evaluating AI Systems As AI systems powered by Large Language Models (LLMs) continue to grow, one big question comes up: 👉 How do we actually evaluate their quality at scale? That’s where LLM-as-a-Judge comes into the picture. 💡 What is LLM-as-a-Judge? It’s an approach where one LLM is used to evaluate the output of another LLM. Instead of relying only on human reviewers, we use AI itself to assess responses based on criteria like correctness, relevance, coherence, and helpfulness. ⚙️ Why do we need it? 🔹 Scalability Manual evaluation doesn’t scale. LLM judges can evaluate thousands of responses quickly. 🔹 Consistency Humans can be subjective. LLM judges provide more standardized evaluation across large datasets. 🔹 Speed in Iteration Faster feedback loops help teams improve prompts, models, and pipelines rapidly. 🔹 Cost Efficiency Reduces dependency on expensive and time-consuming human reviews. 🔹 Essential for GenAI Apps In systems like chatbots, copilots, and RAG pipelines, continuous evaluation is critical — LLM judges make it feasible. 🧠 Where is it used? • Prompt evaluation & tuning • Chatbot response validation • RAG (Retrieval-Augmented Generation) quality checks • Model comparison (A/B testing) • Automated QA for AI systems 🌍 Popular LLM Judges / Frameworks: 🔸 GPT-4 / GPT-4o – Widely used as a strong baseline evaluator 🔸 Claude (by Anthropic) – Known for balanced and safe evaluations 🔸 G-Eval – A structured evaluation framework using LLMs with scoring criteria 🔸 RAGAS – Focused on evaluating RAG pipelines (faithfulness, answer relevance, context precision) 🔸 DeepEval – Open-source framework for testing and evaluating LLM applications 🔸 LangChain Evaluators – Built-in evaluation modules for LLM apps ⚠️ But here’s the catch: LLM judges are powerful, but not perfect. They can be biased, overly lenient, or inconsistent if not designed carefully. 👉 Human validation is still important for critical use cases. 🔥 Big Takeaway: As AI systems evolve, evaluation becomes just as important as generation. LLM-as-a-Judge is quickly becoming a key building block for building reliable and trustworthy AI systems. 🤝 I’m exploring more in AI testing and evaluation — would love to connect with folks working in this space! #AI #LLM #GenAI #AITesting #MachineLearning #RAG #AutomationTesting #QA #TechLearning #FutureOfWork #AIEngineering

1 Comment
Like Comment
To view or add a comment, sign in
Amina Shoukat

The University of Lahore•33 followers
2w
Report this post
This network presents a modern, layered view of Explainable AI (XAI), showing how today’s advanced AI systems are made more understandable, transparent, and trustworthy. At the top layer – “Intuitive Explanations”, the focus is on making AI outputs easy for humans to grasp. Tools and interfaces (like conversational AI systems) simplify complex model behavior into natural language explanations, allowing even non-technical users to understand results. The next layer, “Decision Insight,” dives deeper into why a model made a specific decision. Techniques such as simplified explanations (ELI5-style), reasoning frameworks, and traceable logic help break down predictions into step-by-step insights, improving clarity and accountability. Below that is “Model Understanding,” which represents the core AI systems and architectures powering modern applications. This includes advanced large language models and multimodal systems. At this level, developers and researchers analyze model behavior, internal representations, and performance using tools and frameworks to better interpret how these systems learn and operate. The “Trust & Safety” layer forms the foundation, emphasizing critical aspects like bias detection, fairness, robustness testing, security, and guardrails. It ensures that AI systems are reliable, ethical, and safe to deploy in real-world scenarios. This layer is essential for preventing harmful outputs and maintaining user trust. Surrounding the pyramid are visual elements like graphs, dashboards, and icons, representing monitoring tools, analytics, and evaluation systems used in real-time AI workflows. The central illustration of a human head with a glowing brain symbolizes AI cognition, while the robot with a magnifying glass represents the role of explainability in inspecting and understanding AI decisions. At the bottom, the diagram highlights state-of-the-art AI models (such as GPT-4–like systems, Gemini, Claude, and others), indicating that Explainable AI is especially important in today’s era of powerful, complex models. Overall, the diagram shows that XAI is not a single technique but a multi-layered ecosystem that connects advanced models with human understanding, trust, and responsible use.
1 Comment
Like Comment
To view or add a comment, sign in
Dhruveel Chouhan

DC•2 followers
2w
Report this post
🤖 LLMs in 2026: We're Not Just Talking About Chatbots Anymore The Large Language Model landscape has transformed dramatically — and if you're not paying attention, you're already behind. Here's what's defining AI in 2026: 🧠 Reasoning is the new benchmark Models like GPT-5.4 and Gemini 3.1 Pro don't just generate text — they think. With techniques like RLVR — Reinforcement Learning from Verifiable Rewards, today's LLMs solve complex math, write production-grade code, and reason through multi-step problems with unprecedented accuracy. 🌐 Multimodal is the default Text-only models are legacy. The frontier now processes text, images, audio, and video seamlessly. Gemini 3.1 Pro scores 94.3% on GPQA Diamond. GPT-5.4 handles 1M+ token contexts. The gap between AI and human comprehension is narrowing fast. ⚡ Agentic AI is here LLMs are no longer just answering questions — they're taking actions. Autonomous agents browse the web, write and execute code, manage files, and complete multi-step workflows with minimal human input. 🏗️ Small + Large = Smart Architecture The smartest teams aren't just using frontier models. They're combining powerful LLMs with specialized Small Language Models — SLMs — optimizing for cost, speed, and accuracy at every layer. 🔓 Open-source is closing the gap DeepSeek V3/R1, Meta's Llama 4 — Mango & Avocado, and Qwen3 are proving that open models can compete with — and sometimes beat — closed proprietary systems. ⚖️ Governance is now infrastructure With the EU AI Act in full effect, AI governance isn't a checkbox — it's a core engineering concern. Responsible deployment is a competitive advantage. The question is no longer "Can AI do this?" It's "How fast can your team adapt?" What LLM trend are you most excited about in 2026? Drop it in the comments 👇
Like Comment
To view or add a comment, sign in
Hana R.

115K followers
1mo
Report this post
🧠 6 Types of LLMs Used in AI Agents (2026 Guide) AI agents are becoming more powerful — but behind the scenes, they rely on different types of language models, each designed for a specific purpose. Here’s a simple breakdown every AI developer should understand 👇 📝 GPT (Generative Pre-trained Transformer) Focused on generating human-like text. Best for: • Writing content • Coding assistance • Conversations • Reasoning tasks 👉 Summary: GPT = Generate content 👁 VLM (Vision-Language Model) Combines image understanding + natural language. Best for: • Interpreting screenshots • Understanding diagrams • Analyzing images or videos with text 👉 Summary: VLM = Understand visuals ⚡ SLM (Small Language Model) Lightweight models optimized for speed and cost efficiency. Best for: • Edge devices • On-device AI • Private deployments • Low-latency systems 👉 Summary: SLM = Run fast locally 🧩 MoE (Mixture of Experts) Routes tasks to specialized sub-models instead of using the whole model. Benefits: • Better scalability • Higher compute efficiency • Suitable for massive AI systems 👉 Summary: MoE = Scale efficiently 🧠 LRM (Large Reasoning Model) Focused on deep reasoning and multi-step problem solving. Best for: • Planning tasks • Logical reasoning • Complex decision-making 👉 Summary: LRM = Reason deeply 🤖 LAM (Large Action Model) Built to take actions, not just generate responses. Capabilities: • Tool usage • API interaction • Autonomous task execution 👉 Summary: LAM = Take actions 💡 Simple Way to Think About It • GPT → Generates • VLM → Understands visuals • SLM → Runs locally • MoE → Scales systems • LRM → Reasons deeply • LAM → Takes actions 🚀 The future of AI agents will combine multiple model types working together in one system. Example: Vision + reasoning + action models working together to complete complex tasks autonomously. ✍🏻Virat Radadiya#ai #agent #llm #data #web #coding #agentic #gpt #vlm #tool #learning #work
3 Comments
Like Comment
To view or add a comment, sign in
Justin K. Johnson

Specbee•678 followers
1w Edited
Report this post
Everyone’s talking about how smart AI is getting. Not enough people are talking about how agreeable it’s getting. Recent research is putting a name to it: AI sycophancy. A late-March 2026 study from researchers at Stanford University found that large language models consistently agree with users more than humans would, even when the user is wrong. In some cases, models sided with users ~49% more often than human respondents. Source: https://lnkd.in/eukQxHyh This isn’t isolated to one vendor. Systems from Anthropic (Claude), OpenAI (ChatGPT), and Google (Gemini) all show variations of the same pattern. Why it happens: Most modern models are trained using reinforcement learning from human feedback (RLHF). Humans tend to reward responses that feel helpful, polite, and validating. So the models learn to optimize for exactly that. Agree first. Push back less. Sounds like good UX. It’s not. Follow-on research shows downstream effects: -Users become more confident in incorrect beliefs -Reduced likelihood of reconsidering decisions -Measurable shifts in judgment after interacting with agreeable AI Source: https://lnkd.in/eD3ivQ6j There’s even a growing term for the broader risk: “epistemic environment corruption”. When the system consistently validates you, it changes how you expect truth to show up. Background: https://lnkd.in/eXiS9W2g This isn’t a model problem. It’s an incentive problem. The uncomfortable reality: The same tuning that makes AI feel intuitive and helpful is the thing that makes it unreliable in high-stakes decisions. If you’re using AI in production, client work, or internal workflows, you need to adjust for this: -Prompt for disagreement explicitly -Stress-test outputs against objective sources -Measure accuracy, not just user satisfaction -Treat AI as input, not authority Because an AI that always agrees with you doesn’t make you more effective. It just makes you more confident. I dropped a real world example in the comments. If you want the exact framework I’m using to reduce AI sycophancy in real workflows, drop “framework” in the comments and I’ll DM it to you. ✌

2 Comments
Like Comment
To view or add a comment, sign in
Pete Van Bloem

Freelance By Choice (for now)•2K followers
1mo
Report this post
Fresh perspective on AI in 2-1/2 minute read, specific to Claude, based on Anthropic granular data. The tldr? "Using a large language model as a search engine or copy editor is dumb AI. Even having it draft emails for you is like having a celebrity chef boil your water" #LifelongLearning

Axios AM - 1 big thing: America's next class war — AI fluency axios.com
Like Comment
To view or add a comment, sign in
QUE.com

52 followers
5d
Report this post
Common AI Terms Explained: Your Guide to LLMs and Hallucinations Demystifying Key AI Concepts for Everyday Users Artificial Intelligence (AI) has evolved from a futuristic vision into a practical tool that powers our daily interactions, from chatbots to smart assistants. As AI proliferates across industries, understanding its core terminology becomes essential for professionals, enthusiasts, and casual users alike. This guide breaks down the most common AI terms—especially Large Language Models (LLMs) and hallucinations—into bite-sized explanations, helping you navigate the AI landscape with confidence....

Common AI Terms Explained: Your Guide to LLMs and Hallucinations que.com
Like Comment
To view or add a comment, sign in
AI Agent Developers

638 followers
1w
Report this post
AI is everywhere right now 🤖 But here’s something most people don’t realize: 👉 Not all AI models are the same. Behind the scenes, there are different LLMs (Large Language Models) powering everything you use daily—from chatbots to content tools to automation systems. 🔥 The most popular LLMs today: 💡 GPT (OpenAI) The most widely used — strong in almost everything (writing, coding, automation) 💡 Claude (Anthropic) Known for deep thinking, long-form content, and structured answers 💡 Gemini (Google) Powerful with data, reasoning, and multimodal tasks (text + images + more) 💡 LLaMA (Meta) Popular for customization and building your own AI systems 💡 Mistral & DeepSeek Fast-growing models focused on efficiency and performance 🧠 Here’s the key insight: There is no “best” AI. Each model is good at something different: ⚡ Some are better at writing ⚡ Some at reasoning ⚡ Some at speed or cost ⚡ Some at building full systems 👉 The real advantage is knowing which one to use—and when 🚀 What’s happening next? We’re moving from: 👉 using one AI tool to: 👉 combining multiple AI models into AI agents & workflows That’s where real power (and business advantage) comes from.
Like Comment
To view or add a comment, sign in
Mohammad Anis

Just O Boen•4K followers
5d
Report this post
GPT-4.5 Turbo: Ultimate 2025 AI Model Analysis GPT-4.5 Turbo: The Ultimate 2025 AI Model Analysis Are you struggling to keep pace with the rapid evolution of AI models? Do you find it challenging to evaluate which language model best suits your needs? You're not alone. According to Stanford's AI Index Report 2024, 76% of developers report difficulty keeping up with new model releases. Meanwhile, 63% of enterprises face significant integration challenges when adopting new AI systems. Enter GPT-4.5 Turbo, OpenAI's latest flagship language model setting new standards for AI performance. As Reuters recently reported, this new model represents the most significant leap in AI capabilities since GPT-4. It offers unprecedented reasoning abilities, processing speed, and multimodal understanding. 412% year-over-year growth in searches for "next-gen AI model" (Google Trends, 2025) In this comprehensive analysis, we'll explore how GPT-4.5 Turbo addresses critical challenges in AI adoption. Additionally, we'll examine its evolution from previous models and compare it with leading competitors like Google's Gemini and Anthropic's Claude. Finally, we'll provid... https://lnkd.in/dv7c_6u6 #GPT4.5Turbo #AITools&Data #Google&OpenAITools

GPT-4.5 Turbo: Ultimate 2025 AI Model Analysis justoborn.com
Like Comment
To view or add a comment, sign in

47,250 followers

View Profile Connect

LinkedIn respects your privacy

ARC-AGI-3: AI's Limitations Revealed

More from this author

From Intoxicated Graduates to Professors in Our Pockets, and Beyond

What Having No Inner Voice - Anendophasia - Taught Me About Consciousness

What a Virtual Fish Tank Can Teach Us About the Future of AI Alignment (and what questions the AI safety field needs to be asking).

Explore content categories

ARC-AGI-3: AI's Limitations Revealed

More Relevant Posts

More from this author

From Intoxicated Graduates to Professors in Our Pockets, and Beyond

What Having No Inner Voice - Anendophasia - Taught Me About Consciousness

What a Virtual Fish Tank Can Teach Us About the Future of AI Alignment (and what questions the AI safety field needs to be asking).

Explore related topics

Explore content categories