The agentic infrastructure and the new backend

spinny:~/writing $ less agentic-infrastructure-stack.md

1 
2We have often talked about agentic frameworks. LangGraph, CrewAI, AutoGen, various SDKs, loop, tool calling, memory, planner, critic, supervisor. All useful words, for goodness sake. But the more I look at the agents actually used, the more it seems to me that the interesting part has moved below the framework level.
3 
4The question is no longer just: which library do I use to make a step model think?
5 
6The real question is: where does this agent live when he stops being a demo?
7 
8Because a serious agent is not a function that calls a model and returns text. It's a small distributed system. It must read context, use tools, execute code, touch files, remember decisions, ask permission, fail well, restart, leave logs, not burn the budget and not turn into a bulldozer inside the production repository.
9 
10The framework is the steering wheel. The infrastructure is the road, the brakes, the garage, the insurance and the person who knows where the keys are.
11 
12## Because there's a lot of talk about it now
13 
14In 2023 and 2024 the conversation was very model-centric. Which LLM? How much context? How much does it cost? How good is he at programming?
15 
16In 2025 and 2026 the conversation has shifted. The models are good enough to do real work, but that's why the boring bits become visible: runtime, security, connectors, identity, observability, code execution, deployment, rollback.
17 
18It's the natural transition from magic to engineering.
19 
20When an agent just needs to generate a response, a chat is enough. When you need to open a pull request, query a database, call a CRM, start a job, navigate a site, read Slack, compile code and update a document, you need an operating system around it.
21 
22Not in a literal sense. In an organizational sense.
23 
24## The first piece: a runtime where the agent can last
25 
26An agent often works in steps. Look at the state, choose an action, use a tool, observe the result, update the plan, repeat.
27 
28If this loop lives inside a single HTTP request, you immediately have a problem. Some actions are slow. Some await human input. Some fail and must be tried again. Some must survive a deployment or timeout.
29 
30This is where durable workflows, queues, job backgrounds and state machines come into play. They're not glamorous, but they're the difference between an agent who seems smart on demo and one you can leave working while you go get coffee.
31 
32For me the agentic runtime must answer very concrete questions:
33 
34- where do I save the state between one step and another?
35- what happens if the process dies halfway through?
36- can I pause and ask for approval?
37- can I replay a run to understand why he made that choice?
38- can I limit duration, memory, tools and cost?
39 
40Vercel is pushing hard on this front with AI SDKs, functions, workflows and tools for building agents within web applications. But the point is not just Vercel. The point is that the agent needs an operational home, not a single endpoint.
41 
42## The second piece: sandbox, because the agent must be able to get dirty without breaking
43 
44As soon as an agent writes code or executes commands, a sandbox is needed.
45 
46It seems like a technical word, but the idea is domestic: you give him a workbench. It can open files, install dependencies, run tests, do experiments, generate output. If he gets it wrong, you've contained the damage. If it works, promote the result.
47 
48An agentic sandbox should have some properties:
49 
50- isolated filesystem;
51- CPU, memory and time limits;
52- controlled network;
53- secrets mounted only when needed;
54- complete logs;
55- possibility to export artifacts;
56- clean reset between runs, when necessary.
57 
58Vercel Sandbox goes exactly in this direction: isolated environments to run code, install dependencies, work with files and produce artifacts without running everything in the main application runtime.
59 
60This thing is more important than it seems. Many agentic prototypes jump directly from the model to the real system. The model can call tool. Tools can do things. It all seems elegant until the first wrong command, the first dependency installed in the wrong place, the first token that ends up in a log.
61 
62The sandbox is the adult way of saying: go ahead, but in here.
63 
64## The third piece: MCP and the connector problem
65 
66The Model Context Protocol has become one of the most interesting parts of the ecosystem because it tries to standardize something that otherwise quickly becomes unmanageable: how a model discovers and uses external tools.
67 
68Without a standard, each integration is a small island. A connector for GitHub done one way, one for Slack done another, one for databases with different semantics, one for browser automation that looks like nothing.
69 
70MCP proposes a common language between client and server: tools, resources, prompts, authorizations, transport, discovery. It doesn't magically solve governance and security, but it gives a grammar.
71 
72And grammar matters. When an agent can connect to many tools, the question is not just "can he do it?". The problem is "does he understand what he can do, with what limits, on behalf of whom, and leaving what trace?".
73 
74For me MCP is not hype because it "does tool calling". We already did that. It's hype because it shifts the center of gravity from single integration to the operational catalog of tools.
75 
76In a good agentic architecture, MCP becomes a kind of patch panel:
77 
78- GitHub for code and issues;
79- Slack for conversational context;
80- Linear or Jira for planned work;
81- read-only database for analytics;
82- browser or scraper controlled for external sites;
83- document storage;
84- isolated execution environments;
85- internal systems exposed with strict permissions.
86 
87The tricky part is that a policy-free tool catalog is just a more elegant way to create chaos.
88 
89## The fourth piece: identity and permissions
90 
91This is the area where many demos turn a blind eye.
92 
93An agent acts on someone's behalf. So it must be clear who the subject of the action is.
94 
95Is it using user permissions? Of a service account? Of a workspace? Do you have temporary or permanent access? Can you read everything or just some resources? Can you write? Can you cancel? Can he text real people?
96 
97If you don't answer these questions well, sooner or later you'll build an assistant with house keys and no memory of who gave them to him.
98 
99The rule of thumb I like is this: the agent must be able to do less than the human, not more than the human. And when he has to do something riskier, he has to stop and ask.
100 
101This means OAuth, token scoped, secret management, audit log, tool policy, allowlist, approval step. Not very romantic stuff. Necessary stuff.
102 
103## The fifth piece: memory and context, but without accumulating garbage
104 
105Agents need memory, but memory is dangerous when it becomes an attic.
106 
107There are at least three types of memory:
108 
109- run memory: what happened in this execution;
110- project memory: conventions, decisions, constraints;
111- personal or team memory: preferences, tone, rituals, processes.
112 
113Putting everything in the prompt is the shortcut. It works until it doesn't work anymore. Useful memory must be taken care of: indexed, updated, expired, verified, made citable.
114 
115An agent who remembers badly is worse than an agent who doesn't remember. Because he speaks with confidence.
116 
117Therefore the infrastructure must include retrieval, instruction files, knowledge base, embedding when needed, but also cleaning. We need a culture of memory: what enters, who approves it, when it decays, how do I correct it.
118 
119## The sixth piece: observability, eval and replay
120 
121If an agent makes a mistake, the "called the model" log is not enough.
122 
123You want to see the route. What context did he receive? What tools were available? Which tool did you choose? With what arguments? What response did you get? How much did it cost? Where did it get stuck? Did the human approve of anything? Is the error model, tool, prompt, data or permission error?
124 
125Here the agents are more like distributed systems than chatbots.
126 
127You need readable traces, not just text logs. You need to be able to replay a run. It is necessary to compare two versions of the same agent on known tasks. We need to measure regressions: not only does it "answer better", but it "closes the right ticket without touching unsolicited files".
128 
129Agentic evals are more difficult than text evals because they include actions. It is not enough to compare an expected string. You have to look at sequences, side effects, quality of the artefact, time, cost, number of human interventions.
130 
131The funny thing is, we always come back there: software engineering. Tests, environments, traces, rollbacks. Except that the code now also decides what to do next.
132 
133## The seventh piece: human interfaces
134 
135The agent doesn't have to just live in a chat.
136 
137Some agents need a board. Others a page with status and log. Others of an "approve" button. More inline comments. Still others of a CLI.
138 
139The UI changes behavior. If the only way to control an agent is to write a long message, the user will give the agent vague instructions. If, however, he sees the plan, diff, sources, risks and next action, he can intervene precisely.
140 
141A decent agent infrastructure includes control surfaces:
142 
143- current status;
144- editable plan;
145- produced artefacts;
146- diff;
147- approval requests;
148- chronology;
149- stop button;
150- retry button;
151- visible permissions.
152 
153It seems trivial, but it isn't. The difference between "creepy AI" and "reliable assistant" is often just that the latter shows you where it has its hands.
154 
155## The mental stack
156 
157If I were to draw it today, the minimum agent stack would be this:
158 
1591. Model: reasoning, generation, tool calling, multimodal if necessary.
1602. Orchestration: loop, step, planner, policy, human-in-the-loop.
1613. Durable runtime: workflow, queue, retry, pause, resume.
1624. Sandbox: code execution, isolated file system, limitations, artifacts.
1635. Tool layer: MCP, internal API, browser, database, repository.
1646. Identity layer: OAuth, scope, secret, audit, policy.
1657. Memory layer: project context, retrieval, instructions, expiration.
1668. Observability: trace, replay, eval, cost and quality metrics.
1679. Product surface: chat when enough, dashboard when needed, review when it matters.
168 
169The agentic framework mainly covers points 2 and a piece of point 1. The rest is the real work.
170 
171## What I would do in practice
172 
173If a team told me “we want agents in production,” I wouldn't start with ten agents.
174 
175I would start with a small, repetitive and observable workflow. For example: open maintenance PRs, update documentation from closed issues, prepare a weekly review, triage duplicate bugs, generate tests for affected files.
176 
177Then I would set very clear limits:
178 
179- no writing without branches or sandbox;
180- no secrets in the prompt;
181- tools in allowlist;
182- human approval for external actions;
183- mandatory log and trace;
184- budget per run;
185- output always inspectable.
186 
187Only then would I expand.
188 
189Agents don't fail just because the models get it wrong. They fail because we put them in vague environments, with confusing permissions and theatrical expectations.
190 
191## My reading
192 
193Agentic infrastructure is boring in the best way.
194 
195It's not the part that makes you clap in the demo. It's the part that lets you actually use the demo on Monday morning, with real people, real data, and real consequences.
196 
197The future of agents will not be decided only by who has the best role model. It will be decided by whoever builds the best place in which to make him work: isolated when he experiments, connected when needed, always observable, authorized with criteria and humble enough to stop when he doesn't know.
198 
199That's where agents stop being a toy and become infrastructure.
200 
201## Sources
202 
203- [Vercel: How to build AI agents with Vercel and the AI SDK](https://vercel.com/kb/guide/how-to-build-ai-agents-with-vercel-and-the-ai-sdk)
204- [Vercel Docs: Sandbox](https://vercel.com/docs/sandbox)
205- [Vercel Docs: Working with Sandbox](https://vercel.com/docs/sandbox/working-with-sandbox)
206- [Vercel Docs: MCP](https://vercel.com/docs/mcp)
207- [Model Context Protocol: Specification](https://modelcontextprotocol.io/specification)
208- [OpenAI: New tools for building agents](https://openai.com/index/new-tools-for-building-agents/)
209- [Cloudflare Blog: Agents on Cloudflare](https://blog.cloudflare.com/agents-on-cloudflare/)
210

:The agentic infrastructure and the new backendlines 1-210 (END) — press q to close

2We have often talked about agentic frameworks. LangGraph, CrewAI, AutoGen, various SDKs, loop, tool calling, memory, planner, critic, supervisor. All useful words, for goodness sake. But the more I look at the agents actually used, the more it seems to me that the interesting part has moved below the framework level.

4The question is no longer just: which library do I use to make a step model think?

6The real question is: where does this agent live when he stops being a demo?

8Because a serious agent is not a function that calls a model and returns text. It's a small distributed system. It must read context, use tools, execute code, touch files, remember decisions, ask permission, fail well, restart, leave logs, not burn the budget and not turn into a bulldozer inside the production repository.

10The framework is the steering wheel. The infrastructure is the road, the brakes, the garage, the insurance and the person who knows where the keys are.

12## Because there's a lot of talk about it now

14In 2023 and 2024 the conversation was very model-centric. Which LLM? How much context? How much does it cost? How good is he at programming?

16In 2025 and 2026 the conversation has shifted. The models are good enough to do real work, but that's why the boring bits become visible: runtime, security, connectors, identity, observability, code execution, deployment, rollback.

18It's the natural transition from magic to engineering.

20When an agent just needs to generate a response, a chat is enough. When you need to open a pull request, query a database, call a CRM, start a job, navigate a site, read Slack, compile code and update a document, you need an operating system around it.

22Not in a literal sense. In an organizational sense.

24## The first piece: a runtime where the agent can last

26An agent often works in steps. Look at the state, choose an action, use a tool, observe the result, update the plan, repeat.

28If this loop lives inside a single HTTP request, you immediately have a problem. Some actions are slow. Some await human input. Some fail and must be tried again. Some must survive a deployment or timeout.

30This is where durable workflows, queues, job backgrounds and state machines come into play. They're not glamorous, but they're the difference between an agent who seems smart on demo and one you can leave working while you go get coffee.

32For me the agentic runtime must answer very concrete questions:

34- where do I save the state between one step and another?

35- what happens if the process dies halfway through?

36- can I pause and ask for approval?

37- can I replay a run to understand why he made that choice?

38- can I limit duration, memory, tools and cost?

40Vercel is pushing hard on this front with AI SDKs, functions, workflows and tools for building agents within web applications. But the point is not just Vercel. The point is that the agent needs an operational home, not a single endpoint.

42## The second piece: sandbox, because the agent must be able to get dirty without breaking

44As soon as an agent writes code or executes commands, a sandbox is needed.

46It seems like a technical word, but the idea is domestic: you give him a workbench. It can open files, install dependencies, run tests, do experiments, generate output. If he gets it wrong, you've contained the damage. If it works, promote the result.

48An agentic sandbox should have some properties:

50- isolated filesystem;

51- CPU, memory and time limits;

52- controlled network;

53- secrets mounted only when needed;

54- complete logs;

55- possibility to export artifacts;

56- clean reset between runs, when necessary.

58Vercel Sandbox goes exactly in this direction: isolated environments to run code, install dependencies, work with files and produce artifacts without running everything in the main application runtime.

60This thing is more important than it seems. Many agentic prototypes jump directly from the model to the real system. The model can call tool. Tools can do things. It all seems elegant until the first wrong command, the first dependency installed in the wrong place, the first token that ends up in a log.

62The sandbox is the adult way of saying: go ahead, but in here.

64## The third piece: MCP and the connector problem

66The Model Context Protocol has become one of the most interesting parts of the ecosystem because it tries to standardize something that otherwise quickly becomes unmanageable: how a model discovers and uses external tools.

68Without a standard, each integration is a small island. A connector for GitHub done one way, one for Slack done another, one for databases with different semantics, one for browser automation that looks like nothing.

70MCP proposes a common language between client and server: tools, resources, prompts, authorizations, transport, discovery. It doesn't magically solve governance and security, but it gives a grammar.

72And grammar matters. When an agent can connect to many tools, the question is not just "can he do it?". The problem is "does he understand what he can do, with what limits, on behalf of whom, and leaving what trace?".

74For me MCP is not hype because it "does tool calling". We already did that. It's hype because it shifts the center of gravity from single integration to the operational catalog of tools.

76In a good agentic architecture, MCP becomes a kind of patch panel:

78- GitHub for code and issues;

79- Slack for conversational context;

80- Linear or Jira for planned work;

81- read-only database for analytics;

82- browser or scraper controlled for external sites;

83- document storage;

84- isolated execution environments;

85- internal systems exposed with strict permissions.

87The tricky part is that a policy-free tool catalog is just a more elegant way to create chaos.

89## The fourth piece: identity and permissions

91This is the area where many demos turn a blind eye.

93An agent acts on someone's behalf. So it must be clear who the subject of the action is.

95Is it using user permissions? Of a service account? Of a workspace? Do you have temporary or permanent access? Can you read everything or just some resources? Can you write? Can you cancel? Can he text real people?

97If you don't answer these questions well, sooner or later you'll build an assistant with house keys and no memory of who gave them to him.

99The rule of thumb I like is this: the agent must be able to do less than the human, not more than the human. And when he has to do something riskier, he has to stop and ask.

100

101This means OAuth, token scoped, secret management, audit log, tool policy, allowlist, approval step. Not very romantic stuff. Necessary stuff.

102

103## The fifth piece: memory and context, but without accumulating garbage

104

105Agents need memory, but memory is dangerous when it becomes an attic.

106

107There are at least three types of memory:

108

109- run memory: what happened in this execution;

110- project memory: conventions, decisions, constraints;

111- personal or team memory: preferences, tone, rituals, processes.

112

113Putting everything in the prompt is the shortcut. It works until it doesn't work anymore. Useful memory must be taken care of: indexed, updated, expired, verified, made citable.

114

115An agent who remembers badly is worse than an agent who doesn't remember. Because he speaks with confidence.

116

117Therefore the infrastructure must include retrieval, instruction files, knowledge base, embedding when needed, but also cleaning. We need a culture of memory: what enters, who approves it, when it decays, how do I correct it.

118

119## The sixth piece: observability, eval and replay

120

121If an agent makes a mistake, the "called the model" log is not enough.

122

123You want to see the route. What context did he receive? What tools were available? Which tool did you choose? With what arguments? What response did you get? How much did it cost? Where did it get stuck? Did the human approve of anything? Is the error model, tool, prompt, data or permission error?

124

125Here the agents are more like distributed systems than chatbots.

126

127You need readable traces, not just text logs. You need to be able to replay a run. It is necessary to compare two versions of the same agent on known tasks. We need to measure regressions: not only does it "answer better", but it "closes the right ticket without touching unsolicited files".

128

129Agentic evals are more difficult than text evals because they include actions. It is not enough to compare an expected string. You have to look at sequences, side effects, quality of the artefact, time, cost, number of human interventions.

130

131The funny thing is, we always come back there: software engineering. Tests, environments, traces, rollbacks. Except that the code now also decides what to do next.

132

133## The seventh piece: human interfaces

134

135The agent doesn't have to just live in a chat.

136

137Some agents need a board. Others a page with status and log. Others of an "approve" button. More inline comments. Still others of a CLI.

138

139The UI changes behavior. If the only way to control an agent is to write a long message, the user will give the agent vague instructions. If, however, he sees the plan, diff, sources, risks and next action, he can intervene precisely.

140

141A decent agent infrastructure includes control surfaces:

142

143- current status;

144- editable plan;

145- produced artefacts;

146- diff;

147- approval requests;

148- chronology;

149- stop button;

150- retry button;

151- visible permissions.

152

153It seems trivial, but it isn't. The difference between "creepy AI" and "reliable assistant" is often just that the latter shows you where it has its hands.

154

155## The mental stack

156

157If I were to draw it today, the minimum agent stack would be this:

158

1591. Model: reasoning, generation, tool calling, multimodal if necessary.

1602. Orchestration: loop, step, planner, policy, human-in-the-loop.

1613. Durable runtime: workflow, queue, retry, pause, resume.

1624. Sandbox: code execution, isolated file system, limitations, artifacts.

1635. Tool layer: MCP, internal API, browser, database, repository.

1646. Identity layer: OAuth, scope, secret, audit, policy.

1657. Memory layer: project context, retrieval, instructions, expiration.

1668. Observability: trace, replay, eval, cost and quality metrics.

1679. Product surface: chat when enough, dashboard when needed, review when it matters.

168

169The agentic framework mainly covers points 2 and a piece of point 1. The rest is the real work.

170

171## What I would do in practice

172

173If a team told me “we want agents in production,” I wouldn't start with ten agents.

174

175I would start with a small, repetitive and observable workflow. For example: open maintenance PRs, update documentation from closed issues, prepare a weekly review, triage duplicate bugs, generate tests for affected files.

176

177Then I would set very clear limits:

178

179- no writing without branches or sandbox;

180- no secrets in the prompt;

181- tools in allowlist;

182- human approval for external actions;

183- mandatory log and trace;

184- budget per run;

185- output always inspectable.

186

187Only then would I expand.

188

189Agents don't fail just because the models get it wrong. They fail because we put them in vague environments, with confusing permissions and theatrical expectations.

190

191## My reading

192

193Agentic infrastructure is boring in the best way.

194

195It's not the part that makes you clap in the demo. It's the part that lets you actually use the demo on Monday morning, with real people, real data, and real consequences.

196

197The future of agents will not be decided only by who has the best role model. It will be decided by whoever builds the best place in which to make him work: isolated when he experiments, connected when needed, always observable, authorized with criteria and humble enough to stop when he doesn't know.

198

199That's where agents stop being a toy and become infrastructure.

200

201## Sources

202

203- [Vercel: How to build AI agents with Vercel and the AI SDK](https://vercel.com/kb/guide/how-to-build-ai-agents-with-vercel-and-the-ai-sdk)

204- [Vercel Docs: Sandbox](https://vercel.com/docs/sandbox)

205- [Vercel Docs: Working with Sandbox](https://vercel.com/docs/sandbox/working-with-sandbox)

206- [Vercel Docs: MCP](https://vercel.com/docs/mcp)

207- [Model Context Protocol: Specification](https://modelcontextprotocol.io/specification)

208- [OpenAI: New tools for building agents](https://openai.com/index/new-tools-for-building-agents/)

209- [Cloudflare Blog: Agents on Cloudflare](https://blog.cloudflare.com/agents-on-cloudflare/)

210