Last week, I described four design patterns for AI agentic workflows that I believe will drive significant progress: Reflection, Tool use, Planning and Multi-agent collaboration. Instead of having an LLM generate its final output directly, an agentic workflow prompts the LLM multiple times, giving it opportunities to build step by step to higher-quality output. Here, I'd like to discuss Reflection. It's relatively quick to implement, and I've seen it lead to surprising performance gains. You may have had the experience of prompting ChatGPT/Claude/Gemini, receiving unsatisfactory output, delivering critical feedback to help the LLM improve its response, and then getting a better response. What if you automate the step of delivering critical feedback, so the model automatically criticizes its own output and improves its response? This is the crux of Reflection. Take the task of asking an LLM to write code. We can prompt it to generate the desired code directly to carry out some task X. Then, we can prompt it to reflect on its own output, perhaps as follows: Here’s code intended for task X: [previously generated code] Check the code carefully for correctness, style, and efficiency, and give constructive criticism for how to improve it. Sometimes this causes the LLM to spot problems and come up with constructive suggestions. Next, we can prompt the LLM with context including (i) the previously generated code and (ii) the constructive feedback, and ask it to use the feedback to rewrite the code. This can lead to a better response. Repeating the criticism/rewrite process might yield further improvements. This self-reflection process allows the LLM to spot gaps and improve its output on a variety of tasks including producing code, writing text, and answering questions. And we can go beyond self-reflection by giving the LLM tools that help evaluate its output; for example, running its code through a few unit tests to check whether it generates correct results on test cases or searching the web to double-check text output. Then it can reflect on any errors it found and come up with ideas for improvement. Further, we can implement Reflection using a multi-agent framework. I've found it convenient to create two agents, one prompted to generate good outputs and the other prompted to give constructive criticism of the first agent's output. The resulting discussion between the two agents leads to improved responses. Reflection is a relatively basic type of agentic workflow, but I've been delighted by how much it improved my applications’ results. If you’re interested in learning more about reflection, I recommend: - Self-Refine: Iterative Refinement with Self-Feedback, by Madaan et al. (2023) - Reflexion: Language Agents with Verbal Reinforcement Learning, by Shinn et al. (2023) - CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing, by Gou et al. (2024) [Original text: https://lnkd.in/g4bTuWtU ]
How to Improve Agent Performance With Llms
Explore top LinkedIn content from expert professionals.
Summary
Improving agent performance with large language models (LLMs) means helping AI-powered agents work smarter and more reliably on tasks like research, coding, and customer support. LLMs are advanced artificial intelligence models that understand and generate human-like text, and when used thoughtfully, they can help agents reflect, plan, and collaborate to deliver better results.
- Apply thoughtful prompting: Experiment with various prompting techniques—such as breaking tasks into steps or encouraging reflection—to help LLM-based agents deliver more accurate and reliable responses.
- Choose the right model: Match your AI agent’s model to your specific needs, considering both the task and platform, so the agent performs well whether analyzing documents or running on mobile devices.
- Build agentic workflows: Design agents to use memory, connect to external tools, and work in teams, enabling them to handle complex, real-world tasks beyond simple question answering.
-
-
In the last three months alone, over ten papers outlining novel prompting techniques were published, boosting LLMs’ performance by a substantial margin. Two weeks ago, a groundbreaking paper from Microsoft demonstrated how a well-prompted GPT-4 outperforms Google’s Med-PaLM 2, a specialized medical model, solely through sophisticated prompting techniques. Yet, while our X and LinkedIn feeds buzz with ‘secret prompting tips’, a definitive, research-backed guide aggregating these advanced prompting strategies is hard to come by. This gap prevents LLM developers and everyday users from harnessing these novel frameworks to enhance performance and achieve more accurate results. https://lnkd.in/g7_6eP6y In this AI Tidbits Deep Dive, I outline six of the best and recent prompting methods: (1) EmotionPrompt - inspired by human psychology, this method utilizes emotional stimuli in prompts to gain performance enhancements (2) Optimization by PROmpting (OPRO) - a DeepMind innovation that refines prompts automatically, surpassing human-crafted ones. This paper discovered the “Take a deep breath” instruction that improved LLMs’ performance by 9%. (3) Chain-of-Verification (CoVe) - Meta's novel four-step prompting process that drastically reduces hallucinations and improves factual accuracy (4) System 2 Attention (S2A) - also from Meta, a prompting method that filters out irrelevant details prior to querying the LLM (5) Step-Back Prompting - encouraging LLMs to abstract queries for enhanced reasoning (6) Rephrase and Respond (RaR) - UCLA's method that lets LLMs rephrase queries for better comprehension and response accuracy Understanding the spectrum of available prompting strategies and how to apply them in your app can mean the difference between a production-ready app and a nascent project with untapped potential. Full blog post https://lnkd.in/g7_6eP6y
-
Choosing the right LLM for your AI agent isn't about selecting the most powerful model. It's about finding the right capabilities for your specific use case and limitations. Different tasks require different strengths, whether it's reasoning through complex documents, conducting real-time research, or working efficiently on mobile devices. Understanding these eight key AI agent patterns helps you choose models that perform best for your actual needs instead of just impressive benchmarks. Here's how to match LLMs to your specific AI agent needs: 🔹 Web Browsing & Research Agents: You need models that are good at gathering information and market insights in real-time. GPT-4o with browsing capabilities, Perplexity API, and Gemini 1.5 Pro with API access work well because they can quickly process live web data and gather findings from various sources. 🔹 Document Analysis & RAG Systems: For contract analysis, legal research, and customer support bots, look for models that excel at understanding the context from retrieved documents. GPT-4o, Claude 3 Sonnet, Llama 3 fine-tuned versions, and Mistral with RAG pipelines handle long documents effectively. 🔹 Coding & Development Assistants: Automatic code generation and debugging need models trained specifically for programming tasks. GPT-4o, Claude 3 Opus, StarCoder2, and CodeLlama 70B understand code structure, troubleshoot issues, and explain complex programming concepts better than general models. 🔹 Specialized Domain Applications: Medical assistants, legal co-pilots, and enterprise Q&A bots benefit from specialized fine-tuning. Llama 3, Mistral fine-tuned versions, and Gemma 2B are most effective when customized for specific industries, regulations, and technical terms. Match your model choice to your deployment constraints. Cloud-based agents can use powerful models like GPT-4o and Claude, while edge devices need efficient options like Mistral 7B or TinyLlama. Start with general-purpose models for prototyping. Then optimize with specialized or fine-tuned versions once you know your specific performance needs. #llm #aiagents
-
Agentic AI Design Patterns are emerging as the backbone of real-world, production-grade AI systems, and this is gold from Andrew Ng Most current LLM applications are linear: prompt → output. But real-world autonomy demands more. It requires agents that can reflect, adapt, plan, and collaborate, over extended tasks and in dynamic environments. That’s where the RTPM framework comes in. It's a design blueprint for building scalable agentic systems: ➡️ Reflection ➡️ Tool-Use ➡️ Planning ➡️ Multi-Agent Collaboration Let’s unpack each one from a systems engineering perspective: 🔁 1. Reflection This is the agent’s ability to perform self-evaluation after each action. It's not just post-hoc logging—it's part of the control loop. Agents ask: → Was the subtask successful? → Did the tool/API return the expected structure or value? → Is the plan still valid given current memory state? Techniques include: → Internal scoring functions → Critic models trained on trajectory outcomes → Reasoning chains that validate step outputs Without reflection, agents remain brittle, but with it, they become self-correcting systems. 🛠 2. Tool-Use LLMs alone can’t interface with the world. Tool-use enables agents to execute code, perform retrieval, query databases, call APIs, and trigger external workflows. Tool-use design involves: → Function calling or JSON schema execution (OpenAI, Fireworks AI, LangChain, etc.) → Grounding outputs into structured results (e.g., SQL, Python, REST) → Chaining results into subsequent reasoning steps This is how you move from "text generators" to capability-driven agents. 📊 3. Planning Planning is the core of long-horizon task execution. Agents must: → Decompose high-level goals into atomic steps → Sequence tasks based on constraints and dependencies → Update plans reactively when intermediate states deviate Design patterns here include: → Chain-of-thought with memory rehydration → Execution DAGs or LangGraph flows → Priority queues and re-entrant agents Planning separates short-term LLM chains from persistent agentic workflows. 🤖 4. Multi-Agent Collaboration As task complexity grows, specialization becomes essential. Multi-agent systems allow modularity, separation of concerns, and distributed execution. This involves: → Specialized agents: planner, retriever, executor, validator → Communication protocols: Model Context Protocol (MCP), A2A messaging → Shared context: via centralized memory, vector DBs, or message buses This mirrors multi-threaded systems in software—except now the "threads" are intelligent and autonomous. Agentic Design ≠ monolithic LLM chains. It’s about constructing layered systems with runtime feedback, external execution, memory-aware planning, and collaborative autonomy. Here is a deep-dive blog is you would like to learn more: https://lnkd.in/dKhi_n7M
-
We’re entering an era where AI isn’t just answering questions — it’s starting to take action. From booking meetings to writing reports to managing systems, AI agents are slowly becoming the digital coworkers of tomorrow!!!! But building an AI agent that’s actually helpful — and scalable — is a whole different challenge. That’s why I created this 10-step roadmap for building scalable AI agents (2025 Edition) — to break it down clearly and practically. Here’s what it covers and why it matters: - Start with the right model Don’t just pick the most powerful LLM. Choose one that fits your use case — stable responses, good reasoning, and support for tools and APIs. - Teach the agent how to think Should it act quickly or pause and plan? Should it break tasks into steps? These choices define how reliable your agent will be. - Write clear instructions Just like onboarding a new hire, agents need structured guidance. Define the format, tone, when to use tools, and what to do if something fails. - Give it memory AI models forget — fast. Add memory so your agent remembers what happened in past conversations, knows user preferences, and keeps improving. - Connect it to real tools Want your agent to actually do something? Plug it into tools like CRMs, databases, or email. Otherwise, it’s just chat. - Assign one clear job Vague tasks like “be helpful” lead to messy results. Clear tasks like “summarize user feedback and suggest improvements” lead to real impact. - Use agent teams Sometimes, one agent isn’t enough. Use multiple agents with different roles — one gathers info, another interprets it, another delivers output. - Monitor and improve Watch how your agent performs, gather feedback, and tweak as needed. This is how you go from a working demo to something production-ready. - Test and version everything Just like software, agents evolve. Track what works, test different versions, and always have a backup plan. - Deploy and scale smartly From APIs to autoscaling — once your agent works, make sure it can scale without breaking. Why this matters: The AI agent space is moving fast. Companies are using them to improve support, sales, internal workflows, and much more. If you work in tech, data, product, or operations — learning how to build and use agents is quickly becoming a must-have skill. This roadmap is a great place to start or to benchmark your current approach. What step are you on right now?
-
𝐈 𝐡𝐚𝐯𝐞 𝐬𝐩𝐞𝐧𝐭 𝐭𝐡𝐞 𝐥𝐚𝐬𝐭 𝐲𝐞𝐚𝐫 𝐡𝐞𝐥𝐩𝐢𝐧𝐠 𝐄𝐧𝐭𝐞𝐫𝐩𝐫𝐢𝐬𝐞𝐬 𝐦𝐨𝐯𝐞 𝐟𝐫𝐨𝐦 "𝐈𝐌𝐏𝐑𝐄𝐒𝐒𝐈𝐕𝐄 𝐃𝐄𝐌𝐎𝐒" 𝐭𝐨 "𝐑𝐄𝐋𝐈𝐀𝐁𝐋𝐄 𝐀𝐈 𝐀𝐆𝐄𝐍𝐓𝐒". The pattern is always the same: Teams nail the LLM integration and think the hard part is done, then realize they have built 20% of what production actually requires. 𝐇𝐞𝐫𝐞 𝐢𝐬 𝐰𝐡𝐲 𝐞𝐚𝐜𝐡 𝐛𝐮𝐢𝐥𝐝𝐢𝐧𝐠 𝐛𝐥𝐨𝐜𝐤 𝐦𝐚𝐭𝐭𝐞𝐫𝐬: Reasoning Engine (LLM): Just the Beginning • Interprets intent and generates responses • Without surrounding infrastructure, it is just expensive autocomplete • Real engineering starts when you ask: "How does this agent make decisions it can defend?" Context Assembly: Your Competitive Moat • Where RAG, memory stores, and knowledge retrieval converge • Identical LLMs produce vastly different results based purely on context quality • Prompt engineering does not matter if you are feeding the model irrelevant information Planning Layer: What to Do Next • Breaks goals into steps and decides actions before acting • Separates thinking from doing • Poor planning = agents that thrash or make circular progress Guardrails & Policy Engine: Non-Negotiable • Defines what APIs the agent can call, what data it can access • Determines which decisions require human approval • One misconfigured tool call can cascade into serious business impact Memory Store: Enables Continuity • Short-term state + long-term memory across interactions • Without it, every conversation starts from zero • Context window isn't memory it's just scratchpad Validation & Feedback Loop: How Agents Improve • Logging isn't learning • Capture user corrections, edge cases, quality signals • Best teams treat every interaction as potential training data Observability: Makes the Invisible Visible • When your agent fails, can you trace exactly why? • Which context was retrieved? What reasoning path? What was the token cost? • If you can not answer in under 60 seconds, debugging will kill velocity Cost & Performance Controls: POC vs Product • Intelligent model routing, caching, token optimization are not premature they are survival • Monthly bills can drop 70% with zero accuracy loss through smarter routing What most teams miss: They build top-down (UI → LLM → tools) when they should build bottom-up (infrastructure → observability → guardrails → reasoning). These 11 building blocks are not theoretical. They are what every production agent eventually requires either through intentional design or painful iteration. 𝐖𝐡𝐢𝐜𝐡 𝐛𝐥𝐨𝐜𝐤 𝐚𝐫𝐞 𝐲𝐨𝐮 𝐜𝐮𝐫𝐫𝐞𝐧𝐭𝐥𝐲 𝐮𝐧𝐝𝐞𝐫𝐢𝐧𝐯𝐞𝐬𝐭𝐢𝐧𝐠 𝐢𝐧? ♻️ Repost this to help your network get started ➕ Follow Anurag(Anu) Karuparti for more PS: If you found this valuable, join my weekly newsletter where I document the real-world journey of AI transformation. ✉️ Free subscription: https://lnkd.in/exc4upeq #GenAI #AIAgents
-
What if your smartest AI model could explain the right move, but still made the wrong one? A recent paper from Google DeepMind makes a compelling case: if we want LLMs to act as intelligent agents (not just explainers), we need to fundamentally rethink how we train them for decision-making. ➡ The challenge: LLMs underperform in interactive settings like games or real-world tasks that require exploration. The paper identifies three key failure modes: 🔹Greediness: Models exploit early rewards and stop exploring. 🔹Frequency bias: They copy the most common actions, even if they are bad. 🔹The knowing-doing gap: 87% of their rationales are correct, but only 21% of actions are optimal. ➡The proposed solution: Reinforcement Learning Fine-Tuning (RLFT) using the model’s own Chain-of-Thought (CoT) rationales as a basis for reward signals. Instead of fine-tuning on static expert trajectories, the model learns from interacting with environments like bandits and Tic-tac-toe. Key takeaways: 🔹RLFT improves action diversity and reduces regret in bandit environments. 🔹It significantly counters frequency bias and promotes more balanced exploration. 🔹In Tic-tac-toe, RLFT boosts win rates from 15% to 75% against a random agent and holds its own against an MCTS baseline. Link to the paper: https://lnkd.in/daK77kZ8 If you are working on LLM agents or autonomous decision-making systems, this is essential reading. #artificialintelligence #machinelearning #llms #reinforcementlearning #technology
-
With regards to the future of AI Agents: Why "Context is King" just got more powerful! As someone building with Agentic AI, I have been exploring a breakthrough approach that is transforming our understanding of LLM performance:- Agentic Context Engineering (ACE). Here's what I found noteworthy:- The Problem:- Traditional prompt optimization often suffers from 'brevity bias,' which compresses essential domain-specific insights that agents require. Additionally, iterative refinement can lead to 'context collapse,' where detailed knowledge diminishes over time. The ACE Solution:- Rather than treating contexts as brief summaries, ACE constructs them as comprehensive, evolving playbooks:- +10.6% improvement on agent benchmarks. +8.6% boost on domain-specific tasks. 86.9% lower adaptation latency. Functions without labeled data, learning from execution feedback. Incremental delta updates, modifying only what is necessary instead of regenerating entire contexts. Why this matters for builders:- The standout result? On the AppWorld leaderboard, ACE using DeepSeek-V3 (open-source) matched the top production-level GPT-4 agent. This is not merely academic; it is about making powerful agentic systems more accessible and cost-effective. The three-component architecture (Generator → Reflector → Curator) reflects how humans learn: experiment, reflect, and consolidate knowledge. However, unlike humans, LLMs excel with detailed, comprehensive contexts. Key Insight:- As long-context models improve and KV cache optimization progresses, the bottleneck is not context length but context quality. ACE demonstrates how to create contexts that maintain institutional knowledge, domain expertise, and proven strategies. For those developing AI agents or compound AI systems, this research indicates a clear direction: prioritize your context engineering as much as your model selection. What has been your experience with context optimization? Are you observing similar trends in your AI systems? #AI #AgenticAI #LLMs #MachineLearning #Innovation #AIEngineering
-
LangChain recently published a helpful step-by-step guide on building AI agents. 🔗 How to Build an Agent –https://lnkd.in/dKKjw6Ju It covers key phases: 1. Defining realistic tasks 2. Documenting a standard operating procedure 3. Building an MVP with prompt engineering 4. Connect & Orchestrate 5. Test & Iterate 6. Deploy, Scale, and Refine While the structure is solid, one important dimension that’s often overlooked in agent design is: efficiency at scale. This is where Lean Agentic AI becomes critical—focusing on managing cost, carbon, and complexity from the very beginning. Let’s take a few examples from the blog and view them through a lean lens: 🔍 Task Definition ➡️ If the goal is to extract structured data from invoices, a lightweight OCR + regex or deterministic parser may outperform a full LLM agent in both speed and emissions. Lean principle: Use agents only when dynamic reasoning is truly required—avoid using LLMs for tasks better handled by existing rule-based or heuristic methods 📋 Operating Procedures ➡️ For a customer support agent, identify which inquiries require LLM reasoning (e.g., nuanced refund requests) and which can be resolved using static knowledge bases or templates. Lean principle: Separate deterministic steps from open-ended reasoning early to reduce unnecessary model calls. 🤖 Prompt MVP ➡️ For a lead qualification agent, use a smaller model to classify lead intent before escalating to a larger model for personalized messaging. Lean principle: Choose the best-fit model for each subtask. Optimize prompt structure and token length to reduce waste. 🔗 Tool & Data Integration ➡️ If your agent fetches the same documentation repeatedly, cache results or embed references instead of hitting APIs each time. Lean principle: Reduce external tool calls through caching, and design retry logic with strict limits and fallbacks to avoid silent loops. 🧪 Testing & Iteration ➡️ A multi-step agent performing web search, summarization, and response generation can silently grow in cost. Lean principle: Measure more than output accuracy—track retry count, token usage, latency, and API calls to uncover hidden inefficiencies. 🚀 Deployment ➡️ In a production agent, passing the entire conversation history or full documents into the model for every turn increases token usage and latency—often with diminishing returns. Lean principle: Use summarization, context distillation, or selective memory to trim inputs. Only pass what’s essential for the model to reason, respond, or act.. Lean Agentic AI is a design philosophy that brings sustainability, efficiency, and control to agent development—by treating cost, carbon, and complexity as first-class concerns. For more details, visit 👉 https://leanagenticai.com/ #AgenticAI #LeanAI #LangChain #SustainableAI #LLMOps #FinOpsAI #AIEngineering #ModelEfficiency #ToolCaching #CarbonAwareAI LangChain
-
Large language models (LLMs) can improve their performance not just by retraining but by continuously evolving their understanding through context, as shown by the Agentic Context Engineering (ACE) framework. Consider a procurement team using an AI assistant to manage supplier evaluations. Instead of repeatedly inputting the same guidelines or losing specific insights, ACE helps the AI remember and refine past supplier performance metrics, negotiation strategies, and risk factors over time. This evolving “context playbook” allows the AI to provide more accurate supplier recommendations, anticipate potential disruptions, and adapt procurement strategies dynamically. In supply chain planning, ACE enables the AI to accumulate domain-specific rules about inventory policies, lead times, and demand patterns, improving forecast accuracy and decision-making as new data and insights become available. This approach results in up to 17% higher accuracy in agent tasks and reduces adaptation costs and time by more than 80%. It also supports self-improvement through feedback like execution outcomes or supply chain KPIs, without requiring labeled data. By modularizing the process—generating suggestions, reflecting on results, and curating updates—ACE builds robust, scalable AI tools that continuously learn and adapt to complex business environments. #AI #SupplyChain #Procurement #LLM #ContextEngineering #BusinessIntelligence