spinny:~/writing $ cat agentic-infrastructure-stack.md

The agentic infrastructure and the new backend

2026-06-30 · 10 min read · Filippo Spinella · AI, Agents, Infrastructure, Developer Tools

We have often talked about agentic frameworks. LangGraph, CrewAI, AutoGen, various SDKs, loop, tool calling, memory, planner, critic, supervisor. All useful words, for goodness sake. But the more I look at the agents actually used, the more it seems to me that the interesting part has moved below the framework level.

The question is no longer just: which library do I use to make a step model think?

The real question is: where does this agent live when he stops being a demo?

Because a serious agent is not a function that calls a model and returns text. It's a small distributed system. It must read context, use tools, execute code, touch files, remember decisions, ask permission, fail well, restart, leave logs, not burn the budget and not turn into a bulldozer inside the production repository.

The framework is the steering wheel. The infrastructure is the road, the brakes, the garage, the insurance and the person who knows where the keys are.

Because there's a lot of talk about it now

In 2023 and 2024 the conversation was very model-centric. Which LLM? How much context? How much does it cost? How good is he at programming?

In 2025 and 2026 the conversation has shifted. The models are good enough to do real work, but that's why the boring bits become visible: runtime, security, connectors, identity, observability, code execution, deployment, rollback.

It's the natural transition from magic to engineering.

When an agent just needs to generate a response, a chat is enough. When you need to open a pull request, query a database, call a CRM, start a job, navigate a site, read Slack, compile code and update a document, you need an operating system around it.

Not in a literal sense. In an organizational sense.

The first piece: a runtime where the agent can last

An agent often works in steps. Look at the state, choose an action, use a tool, observe the result, update the plan, repeat.

If this loop lives inside a single HTTP request, you immediately have a problem. Some actions are slow. Some await human input. Some fail and must be tried again. Some must survive a deployment or timeout.

This is where durable workflows, queues, job backgrounds and state machines come into play. They're not glamorous, but they're the difference between an agent who seems smart on demo and one you can leave working while you go get coffee.

For me the agentic runtime must answer very concrete questions:

where do I save the state between one step and another?
what happens if the process dies halfway through?
can I pause and ask for approval?
can I replay a run to understand why he made that choice?
can I limit duration, memory, tools and cost?

Vercel is pushing hard on this front with AI SDKs, functions, workflows and tools for building agents within web applications. But the point is not just Vercel. The point is that the agent needs an operational home, not a single endpoint.

The second piece: sandbox, because the agent must be able to get dirty without breaking

As soon as an agent writes code or executes commands, a sandbox is needed.

It seems like a technical word, but the idea is domestic: you give him a workbench. It can open files, install dependencies, run tests, do experiments, generate output. If he gets it wrong, you've contained the damage. If it works, promote the result.

An agentic sandbox should have some properties:

isolated filesystem;
CPU, memory and time limits;
controlled network;
secrets mounted only when needed;
complete logs;
possibility to export artifacts;
clean reset between runs, when necessary.

Vercel Sandbox goes exactly in this direction: isolated environments to run code, install dependencies, work with files and produce artifacts without running everything in the main application runtime.

This thing is more important than it seems. Many agentic prototypes jump directly from the model to the real system. The model can call tool. Tools can do things. It all seems elegant until the first wrong command, the first dependency installed in the wrong place, the first token that ends up in a log.

The sandbox is the adult way of saying: go ahead, but in here.

The third piece: MCP and the connector problem

The Model Context Protocol has become one of the most interesting parts of the ecosystem because it tries to standardize something that otherwise quickly becomes unmanageable: how a model discovers and uses external tools.

Without a standard, each integration is a small island. A connector for GitHub done one way, one for Slack done another, one for databases with different semantics, one for browser automation that looks like nothing.

MCP proposes a common language between client and server: tools, resources, prompts, authorizations, transport, discovery. It doesn't magically solve governance and security, but it gives a grammar.

And grammar matters. When an agent can connect to many tools, the question is not just "can he do it?". The problem is "does he understand what he can do, with what limits, on behalf of whom, and leaving what trace?".

For me MCP is not hype because it "does tool calling". We already did that. It's hype because it shifts the center of gravity from single integration to the operational catalog of tools.

In a good agentic architecture, MCP becomes a kind of patch panel:

GitHub for code and issues;
Slack for conversational context;
Linear or Jira for planned work;
read-only database for analytics;
browser or scraper controlled for external sites;
document storage;
isolated execution environments;
internal systems exposed with strict permissions.

The tricky part is that a policy-free tool catalog is just a more elegant way to create chaos.

The fourth piece: identity and permissions

This is the area where many demos turn a blind eye.

An agent acts on someone's behalf. So it must be clear who the subject of the action is.

Is it using user permissions? Of a service account? Of a workspace? Do you have temporary or permanent access? Can you read everything or just some resources? Can you write? Can you cancel? Can he text real people?

If you don't answer these questions well, sooner or later you'll build an assistant with house keys and no memory of who gave them to him.

The rule of thumb I like is this: the agent must be able to do less than the human, not more than the human. And when he has to do something riskier, he has to stop and ask.

This means OAuth, token scoped, secret management, audit log, tool policy, allowlist, approval step. Not very romantic stuff. Necessary stuff.

The fifth piece: memory and context, but without accumulating garbage

Agents need memory, but memory is dangerous when it becomes an attic.

There are at least three types of memory:

run memory: what happened in this execution;
project memory: conventions, decisions, constraints;
personal or team memory: preferences, tone, rituals, processes.

Putting everything in the prompt is the shortcut. It works until it doesn't work anymore. Useful memory must be taken care of: indexed, updated, expired, verified, made citable.

An agent who remembers badly is worse than an agent who doesn't remember. Because he speaks with confidence.

Therefore the infrastructure must include retrieval, instruction files, knowledge base, embedding when needed, but also cleaning. We need a culture of memory: what enters, who approves it, when it decays, how do I correct it.

The sixth piece: observability, eval and replay

If an agent makes a mistake, the "called the model" log is not enough.

You want to see the route. What context did he receive? What tools were available? Which tool did you choose? With what arguments? What response did you get? How much did it cost? Where did it get stuck? Did the human approve of anything? Is the error model, tool, prompt, data or permission error?

Here the agents are more like distributed systems than chatbots.

You need readable traces, not just text logs. You need to be able to replay a run. It is necessary to compare two versions of the same agent on known tasks. We need to measure regressions: not only does it "answer better", but it "closes the right ticket without touching unsolicited files".

Agentic evals are more difficult than text evals because they include actions. It is not enough to compare an expected string. You have to look at sequences, side effects, quality of the artefact, time, cost, number of human interventions.

The funny thing is, we always come back there: software engineering. Tests, environments, traces, rollbacks. Except that the code now also decides what to do next.

The seventh piece: human interfaces

The agent doesn't have to just live in a chat.

Some agents need a board. Others a page with status and log. Others of an "approve" button. More inline comments. Still others of a CLI.

The UI changes behavior. If the only way to control an agent is to write a long message, the user will give the agent vague instructions. If, however, he sees the plan, diff, sources, risks and next action, he can intervene precisely.

A decent agent infrastructure includes control surfaces:

current status;
editable plan;
produced artefacts;
diff;
approval requests;
chronology;
stop button;
retry button;
visible permissions.

It seems trivial, but it isn't. The difference between "creepy AI" and "reliable assistant" is often just that the latter shows you where it has its hands.

The mental stack

If I were to draw it today, the minimum agent stack would be this:

Model: reasoning, generation, tool calling, multimodal if necessary.
Orchestration: loop, step, planner, policy, human-in-the-loop.
Durable runtime: workflow, queue, retry, pause, resume.
Sandbox: code execution, isolated file system, limitations, artifacts.
Tool layer: MCP, internal API, browser, database, repository.
Identity layer: OAuth, scope, secret, audit, policy.
Memory layer: project context, retrieval, instructions, expiration.
Observability: trace, replay, eval, cost and quality metrics.
Product surface: chat when enough, dashboard when needed, review when it matters.

The agentic framework mainly covers points 2 and a piece of point 1. The rest is the real work.

What I would do in practice

If a team told me “we want agents in production,” I wouldn't start with ten agents.

I would start with a small, repetitive and observable workflow. For example: open maintenance PRs, update documentation from closed issues, prepare a weekly review, triage duplicate bugs, generate tests for affected files.

Then I would set very clear limits:

no writing without branches or sandbox;
no secrets in the prompt;
tools in allowlist;
human approval for external actions;
mandatory log and trace;
budget per run;
output always inspectable.

Only then would I expand.

Agents don't fail just because the models get it wrong. They fail because we put them in vague environments, with confusing permissions and theatrical expectations.

My reading

Agentic infrastructure is boring in the best way.

It's not the part that makes you clap in the demo. It's the part that lets you actually use the demo on Monday morning, with real people, real data, and real consequences.

The future of agents will not be decided only by who has the best role model. It will be decided by whoever builds the best place in which to make him work: isolated when he experiments, connected when needed, always observable, authorized with criteria and humble enough to stop when he doesn't know.

That's where agents stop being a toy and become infrastructure.

Sources