There’s a thing happening right now in AI that doesn’t have a clean narrative yet, and that’s exactly why it’s worth paying attention to.
For the last two years, the story of AI in the enterprise has been relatively straightforward: take a large language model, connect it to your data, ask it questions or have it generate content. RAG pipelines, chatbots, summarization, classification. Useful stuff. Well-understood patterns. Increasingly commoditized.
The next phase is different, and it’s messier. The industry is calling it “agentic AI”: systems where models don’t just respond to prompts but autonomously plan, execute multi-step tasks, use tools, make decisions, and recover from errors. Instead of “answer this question,” it’s “accomplish this objective.”
We’ve been deep in this space for the last several months, building some of it, studying the rest. Here’s our honest read on where things stand.
What’s real
The core capability is real and it’s not hype. Models can now reliably decompose a complex task into subtasks, decide which tools to use, execute those tools, evaluate the results, and adjust their approach. This isn’t theoretical. It’s in production at companies ranging from startups to the largest tech firms in the world.
Anthropic’s work on tool use and agent architectures has been particularly interesting. Their “Building Effective Agents” research lays out a spectrum from simple prompt chaining to fully autonomous agents, and the key insight is that most production value sits in the middle of that spectrum. Not in the fully autonomous science fiction version, but in structured workflows where models handle decision-making within well-defined boundaries.
That tracks with what we’re seeing. The agentic systems that work in production aren’t the ones trying to be general-purpose autonomous reasoners. They’re the ones with a clear scope, well-defined tool sets, and humans in the loop at the right checkpoints. An agent that can navigate your CI/CD pipeline, identify why a build failed, and either fix the issue or escalate with full context? That’s real and that’s valuable. An agent that “runs your business”? That’s a pitch deck, not a product.
What’s promising but early
Multi-agent orchestration, where multiple specialized agents collaborate on a task, is the area we’re watching most closely. The idea is compelling: instead of one model trying to be good at everything, you have a planner agent that decomposes work, specialist agents that handle specific domains, and an orchestration layer that manages handoffs and resolves conflicts.
The research is moving fast. Every major lab is publishing on this. The open-source ecosystem is exploding with agent frameworks: LangGraph, CrewAI, AutoGen, dozens of others. There’s real energy here.
But the gap between demo and production is significant. Multi-agent systems are hard to debug, hard to test, hard to predict. When Agent A passes a task to Agent B and Agent B makes a decision that Agent A didn’t anticipate, diagnosing what went wrong requires reasoning about the interaction between two probabilistic systems. The tooling for this doesn’t really exist yet. People are building it, but it’s early.
The cost model is also unsettled. Agentic workflows are token-hungry by nature. The planning, reasoning, and self-correction loops consume significantly more compute than a single inference call. For high-value tasks where the alternative is hours of human time, the economics work. For high-volume, low-value tasks, they might not. The math is different for every use case and it’s changing as model costs come down, but anyone building agentic systems today needs to be thinking about unit economics, not just capability.
What concerns us
The agent framework ecosystem is moving at a pace that makes us nervous. Not because the frameworks are bad (several are quite good) but because the speed of iteration is encouraging teams to adopt patterns they don’t fully understand.
We’re seeing teams build agent systems on top of abstractions three and four layers deep. The agent uses a framework that wraps the model API that wraps the inference layer. When something goes wrong (and something always goes wrong) the debugging surface is enormous. The team can see that the agent produced a bad output. They cannot easily trace why, because the decision-making happened inside a chain of abstracted function calls across multiple libraries, each with their own versioning and behavior characteristics.
This is a familiar pattern in software engineering. It happened with microservices. It happened with Kubernetes. A powerful new architecture emerges, frameworks proliferate to make it accessible, teams adopt the frameworks before they understand the underlying principles, and then they spend the next two years debugging problems that the abstraction was supposed to prevent.
We’re not saying don’t use frameworks. We’re saying understand what the framework is doing before you trust it with production workloads. Read the source. Run the traces. Know where your prompts go and what happens when the model surprises you.
What we’re doing about it
We’re taking a position that’s probably boring compared to the hype: build the simplest agentic system that solves the problem, instrument the hell out of it, and stay close enough to the primitives that you can debug it when it breaks.
For most use cases today, that means structured workflows with model-driven decision points, not fully autonomous agents. The model decides which branch to take, which tool to call, how to interpret results. But the workflow itself is defined, bounded, and observable. Every decision point logs the reasoning. Every tool call is traceable. Every output can be explained.
When the use case genuinely requires more autonomy (and some do) we add it incrementally, with clear evaluation criteria at each step. Can the agent reliably handle this additional degree of freedom? What’s the failure rate? What’s the blast radius when it fails? The answer to those questions determines whether we expand the agent’s scope or keep the human in the loop for that particular decision.
It’s not the sexiest approach. But the teams running production agent systems that actually work are all doing some version of this. The ones chasing full autonomy on day one are the ones posting retrospectives about why they walked it back.
What to watch
A few things we think will matter in the next twelve months:
Evaluation frameworks for agents. The industry desperately needs better ways to test and evaluate agentic systems. Unit testing a deterministic function is straightforward. Testing a system that makes probabilistic decisions across multi-step workflows is a fundamentally different problem. The teams working on this (and there are several) are doing some of the most important work in the space right now.
Cost trajectories. If inference costs continue to drop (and everything suggests they will) the set of problems where agentic approaches are economically viable expands dramatically. Tasks that don’t make sense at current pricing might make perfect sense in eighteen months. The smart play is to understand the patterns now so you’re ready when the economics tip.
Reliability at scale. The difference between an agent that works 95% of the time and one that works 99.5% of the time is the difference between a demo and a product. Closing that gap for complex, multi-step workflows is the core engineering challenge. The teams that figure it out will have something genuinely defensible.
Regulation and liability. When an agent makes a consequential decision autonomously, who’s accountable? This is a legal and organizational question more than a technical one, and it’s going to shape what kinds of agentic systems are viable in regulated industries. It’s early, but it’s worth watching.
The bottom line
Agentic AI is the most interesting development in applied AI since the transformer. It’s going to create genuine value and it’s going to generate a spectacular amount of waste as companies rush to build agent systems they don’t need, on top of frameworks they don’t understand, for problems that don’t require autonomy.
The opportunity for engineering teams right now is to learn the fundamentals: how tool use works, how planning and reasoning loops work, how to evaluate agentic systems, how to build them incrementally. Without getting caught up in the framework-of-the-week cycle.
The technology is moving fast. The principles are stable. Invest in the principles.
We publish our thinking on AI systems as we develop it. No newsletter, no gated content, just what we’re learning, shared openly. More at steadfastdigital.io/signals.