New tools for building agents
Let’s cut the marketing noise. If you read the press releases, you’d think AI agents are fully autonomous digital employees ready to replace your entire engineering department by next Tuesday. The reality on the ground is far less glamorous. Most "agents" running in production today are fragile `while(true)` loops wrapped around an LLM API, held together by prompt engineering and prayer.
But the tooling is finally maturing. We are moving past the chaotic Python scripts of 2023 and the over-engineered monoliths of 2024. The industry has realized that sending unstructured text to a stochastic model and hoping for JSON back is not a scalable architecture.
In 2026, we have actual primitives.
This guide breaks down the tools that actually work, the frameworks that are overhyped, and the infrastructure you need to stop your agents from stuck in infinite loops trying to `git push` to a deleted repository.
## The Shift to Deterministic Orchestration
For the last two years, developers tried to let the LLM dictate the control flow. You would give an agent a list of tools, write a prompt saying "think step by step," and let it figure out what to do. This is the ReAct (Reasoning and Acting) pattern.
It works wonderfully in demos. It fails spectacularly in production.
Models hallucinate tool names. They pass strings instead of integers. They get stuck in loops where they fetch a webpage, fail to parse the DOM, and decide to fetch the exact same webpage fifty more times until your API credits evaporate.
The industry solution is to rip control flow away from the LLM.
### Enter LangGraph
LangChain started as a wrapper for API calls. It was bloated and abstracted the wrong things. But **LangGraph** is a different beast entirely. It treats agent workflows as cyclic graphs. Instead of letting the LLM decide what happens next, you define a strict state machine.
With LangGraph, nodes are functions (or LLM calls), and edges are conditional logic written in actual code, not prompts. The LLM acts as a reasoning engine at specific decision points, but the guardrails are hardcoded.
```python
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator
# Define your state explicitly
class AgentState(TypedDict):
input_query: str
intermediate_steps: Annotated[list, operator.add]
final_output: str
error_count: int
def llm_reasoning_node(state: AgentState):
# Call the model here
if state["error_count"] > 3:
return {"final_output": "Task failed after 3 retries."}
return {"intermediate_steps": ["Called database"]}
def tool_execution_node(state: AgentState):
# Execute actual Python code
return {"error_count": 0}
workflow = StateGraph(AgentState)
workflow.add_node("reason", llm_reasoning_node)
workflow.add_node("act", tool_execution_node)
# Hardcoded conditional logic, not prompt-driven
workflow.add_conditional_edges(
"reason",
lambda x: "act" if not x.get("final_output") else END
)
```
This graph-based approach means you can persist the state at any node, pause execution for a human to approve a destructive action, and resume it hours later. It allows agents to act as reliable connectors between systems, interpreting outputs from one platform and triggering actions in another without months of bespoke API development.
### Microsoft's Agent Framework
If LangGraph is for the Python hackers, **Microsoft’s Agent Framework** is for the Enterprise C# crowd. It provides boilerplate that naturally hooks into Active Directory, Azure OpenAI, and existing internal systems.
It is verbose. It requires interfaces for everything. But when you are building an agent that has permission to touch payroll data, you want verbosity. You want strict type checking and dependency injection. Microsoft's play here isn't to be the most elegant framework; it's to be the most compliant. It gives enterprise architects the warm, fuzzy feeling they need to sign off on deployments.
### CrewAI: Roleplay for APIs
**CrewAI** takes a different approach. It focuses on multi-agent architectures where you define specific "personas" for your agents and let them debate each other.
You define a "Senior Data Analyst" agent and a "Quality Assurance" agent. The analyst writes the SQL, and the QA agent reviews it.
```yaml
# Typical CrewAI configuration
agents:
analyst:
role: "Senior Data Analyst"
goal: "Extract user retention metrics from the Postgres cluster."
backstory: "You are a cynical veteran DBA who hates ORMs."
reviewer:
role: "Database Administrator"
goal: "Ensure queries do not drop tables or perform full table scans."
```
Is this useful? Sometimes. Assigning distinct roles helps prevent a single context window from becoming overly confused. By forcing the model to evaluate the output from a different "perspective," you can catch logic errors. However, treating LLMs like a simulation of a corporate office is computationally expensive and heavily dependent on prompt adherence. CrewAI is excellent for fast prototyping, but you will likely strip it down to pure state machines once you hit scale.
## The Protocol That Changes Everything: MCP
We need to talk about the actual breakthrough in agent architecture this year: the **Model Context Protocol (MCP)**.
Until recently, every time you built an agent, you had to write custom glue code to teach the LLM how to interact with your specific API. You wrote JSON schemas, you handled HTTP errors, you parsed the markdown outputs.
MCP standardizes this. It provides a universal socket for AI agents.
Think of it like the Language Server Protocol (LSP) for code editors. Before LSP, every editor had to write custom syntax highlighting for every language. Now, the language provides a server, and the editor provides a client.
MCP does this for AI tools. You run an MCP Server that exposes your internal tools (like a web browser, a GitHub repo, or a proprietary database), and any MCP-compliant agent can immediately discover and use those tools.
### Building an MCP Tool
Here is what an MCP tool definition looks like in practice. Notice how cleanly it maps to standard JSON Schema:
```json
{
"name": "browser_navigate",
"description": "Navigate a headless browser to a specific URL and extract the readable DOM.",
"parameters": {
"type": "object",
"properties": {
"url": {
"type": "string",
"format": "uri",
"description": "The fully qualified HTTP/HTTPS URL"
},
"wait_for_selector": {
"type": "string",
"description": "Optional CSS selector to wait for before extracting text"
}
},
"required": ["url"]
}
}
```
When your agent connects to an MCP server, it requests a list of available tools. The server responds with these definitions. The agent’s LLM evaluates the user’s prompt, realizes it needs to search the web, formats a standard JSON-RPC request targeting `browser_navigate`, and fires it over the MCP transport layer (usually stdio or HTTP/SSE).
```javascript
// A simple Node.js MCP Client executing a tool call
const client = new McpClient({ transport: new StdioTransport() });
await client.connect();
const tools = await client.listTools();
console.log(`Discovered tools: ${tools.map(t => t.name)}`);
// The LLM decides to call a tool based on the schema
const result = await client.callTool("browser_navigate", {
url: "https://news.ycombinator.com"
});
console.log(result.content);
```
This protocol decoupling is massive. It means your infrastructure team can build secure, audited MCP servers that expose internal databases, and your AI engineers can swap out the underlying reasoning models (OpenAI, Anthropic, local Llama) without rewriting a single line of integration code.
## OpenAI's Native Building Blocks
While the open-source community rallies around MCP and LangGraph, OpenAI is making a distinct play for the entire stack. They recently released a suite of native building blocks designed to keep developers firmly inside their ecosystem.
OpenAI's goal is to transition from an inference provider to the operating system of the agentic web. They view agents as persistent systems that independently accomplish tasks on behalf of users. Over the past year, they've introduced advanced model capabilities tailored specifically for this: native function calling, structured outputs, and the Assistants API.
Their new primitives allow you to spin up a stateful thread, attach a vector store, and register tools directly against their API.
### The Lock-in Tradeoff
Using OpenAI's native tools is incredibly fast. You don't have to manage state databases, you don't have to chunk and embed documents for RAG, and you don't have to write complex retry logic.
```bash
# Registering an assistant via OpenAI CLI/API
curl https://api.openai.com/v1/assistants \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"instructions": "You are a deployment agent. Only use the provided github_deploy tool.",
"name": "DeployBot",
"tools": [{"type": "function", "function": {"name": "github_deploy"}}],
"model": "gpt-4o"
}'
```
But the tradeoff is total vendor lock-in. Your agent's memory, state, and tool definitions are hosted by OpenAI. If you want to switch to a cheaper model for simple tasks, or if you need to run an open-weights model locally for data privacy, you have to rewrite your entire orchestration layer.
For quick internal prototypes, OpenAI's building blocks are phenomenal. For mission-critical enterprise systems, handing over the entire execution state to a third-party API is an architectural risk that many engineering leads refuse to take.
## Observability: Staring into the Black Box
The dirty secret of building AI agents is that they fail silently in bizarre ways. An agent might successfully complete a task, but only after wasting three dollars on useless recursive API calls. Standard application performance monitoring (APM) tools like Datadog or New Relic are entirely unequipped for this. They show you API latency and HTTP 500s; they do not show you *why* an LLM decided to delete a database row instead of updating it.
### Visualizing the Execution Graph
This is where specialized observability platforms like **Vellum** come into play. Vellum makes the entire iterative process visible through a visual workflow canvas.
When an agent executes, Vellum logs the exact prompt sent to the LLM, the raw JSON response, the specific tool called, the output of that tool, and the subsequent prompt generated by the agent to evaluate its own work.
You can see every tool call and every reasoning step.
If your agent fails in production, you can open the trace, see exactly which intermediate step confused the model, adjust the system prompt or the tool schema, and replay the exact execution path against the new logic to verify the fix.
Without this level of observability, debugging an agent is like debugging a microservice architecture by reading kernel logs. It is technically possible, but it will make you hate your life.
## The Executive Math
If all this sounds complicated, brittle, and expensive, you might wonder why we are building these things at all.
The answer is simple: the financial upside is too massive to ignore.
The suits have seen the math. According to PwC, roughly 88% of executives say their companies plan to increase their AI-related budgets this year. The primary driver isn't chat interfaces; it is agentic AI.
The early adopters are seeing actual returns. Of those actively deploying agentic systems, 66% report measurable increases in productivity, and 57% report direct cost savings.
IBM's research paints an even starker picture of the timeline. Globally, executives expect the proportion of enterprise workflows enabled by AI to grow from a marginal 3% today to 25% by the end of 2025. Furthermore, 70% of these decision-makers view agentic AI as fundamental to their organization’s future survival.
This isn't just hype; it is a forced march toward automation. The budgets are unlocked. The mandate is clear. Your job is to make sure the implementation doesn't bring down production.
## Framework Comparison Matrix
To make sense of the noise, here is how the primary tools stack up against each other in 2026:
| Tool / Framework | Mental Model | Best Use Case | Primary Drawback | Lock-in Level |
| :--- | :--- | :--- | :--- | :--- |
| **LangGraph** | Cyclic State Machine | Production systems requiring strict control flow and human-in-the-loop logic. | Steep learning curve, verbose graph definitions. | Low (Model agnostic) |
| **CrewAI** | Multi-Agent Roleplay | Rapid prototyping, complex research tasks requiring multiple "perspectives". | Non-deterministic, hard to constrain in strict CI/CD pipes. | Low (Model agnostic) |
| **Microsoft Agent Framework** | Enterprise Boilerplate | Deep integration with Azure, .NET stacks, and legacy Microsoft enterprise tools. | Heavyweight, overkill for simple autonomous tasks. | Medium (Azure ecosystem bias) |
| **OpenAI Assistants API** | Hosted Managed State | Building high-quality agents quickly without managing your own database or RAG pipeline. | You surrender control of the orchestration state. | High (Tied to OpenAI) |
| **MCP (Protocol)** | Universal Sockets | Standardizing internal APIs so *any* agent framework can consume them safely. | Requires rewriting existing bespoke API wrappers. | None (Open standard) |
## Hard Lessons from Production
Before you `pip install` or `npm install` any of the frameworks above, memorize these architectural realities:
### 1. Agents Cannot Write Their Own Tools
Do not give an agent `eval()` or unrestricted shell access and tell it to figure out how to query your database. You will create a security disaster, and the agent will waste tokens writing boilerplate. Write strict, strongly typed tools (via MCP) and let the agent call them. The agent should provide the parameters; you provide the execution environment.
### 2. State Must Be Persisted Externally
If an agent is running a workflow that takes 45 minutes, the process will inevitably die, network requests will timeout, or the LLM provider will throw a 529 Overloaded error. Your orchestration framework must support pausing and resuming from an external database (Postgres, Redis). This is why LangGraph is currently winning the production wars.
### 3. Human-in-the-Loop is Mandatory for State Changes
Read-only agents can run autonomously. Agents that mutate state (update databases, send emails, trigger deployments) must hit an approval checkpoint. Build your graphs so execution suspends, sends a Slack ping with the proposed JSON payload, and waits for a human to click "Approve" before resuming.
## Actionable Takeaways
You want to build an agent today. Stop reading think-pieces and follow this playbook:
1. **Avoid the raw ReAct loop:** Do not write your own while-loop parser for LLM outputs. You will spend six months reinventing a broken wheel.
2. **Standardize on MCP:** Expose your company's internal data and APIs via the Model Context Protocol. Separate the tools from the intelligence.
3. **Start with LangGraph:** Map your business logic out on a whiteboard. Define exactly what the state object looks like. Code the edges of the graph before you write a single LLM prompt.
4. **Demand visual observability:** Instrument your code with Vellum or an equivalent tracing platform from day one. If you cannot see the exact JSON the LLM generated before it crashed, you are flying blind.
5. **Ignore the multi-agent hype:** You probably do not need five different LLM personas debating each other. You need one reliable LLM routing a user request to a well-typed Python function. Keep it boring.
The era of impressive AI demos is over. We are now in the era of reliable systems engineering. Act accordingly.