Codex and multi-agent workflow: work with agents without losing control
· 7 min read · Filippo Spinella · AI, Developer Tools, Productivity, Software Engineering
The first time a coding agent actually fixes a bug for you, the reaction is almost always the same: a mixture of enthusiasm and suspicion. Nice, sure. But then you look at the diff and ask yourself: "Ok, but what exactly did he touch? Can I trust him? Will he do it again in the same way tomorrow?".
That's where I think the interesting part begins. Not when the agent writes a function, but when it becomes capable enough to take on entire pieces of work: read the repository, create a patch, run tests, open a PR, come back after a review comment. Codex is moving precisely in that direction: background work, separate worktrees, integrated browser, automations, plugins, memory and more explicit permission controls.
The point is not to imagine a future where no one reads code anymore. It would be a terrible future, as well as quite naive. The point is to figure out how to work with agents who can do a lot without letting them do everything.
The change of habit
With the traditional autocomplete you were always at the wheel. The AI suggested a line, you decided. With an agent, however, the relationship changes: you give him a goal and he goes through multiple steps on his own.
This is powerful, but it shifts the problem. The question is no longer just "can the model program?". The question becomes:
- Did I give him a small enough scope?
- do you know how to check the result?
- Am I working in an isolated environment?
- Is the final review still humane and careful?
A healthy workflow looks more like this than a magic wand:
It sounds less romantic than "the agent builds everything", but it works much better. And it's also how teams that are good with humans work: clear tasks, quick feedback, explicit accountability.
The good prompt is almost a good ticket
The most dangerous prompt is the vague but confident one: "fix the invoices page", "improve the architecture", "clean up the auth module". These are requests that sound productive and generate huge diffs. But then you find yourself doing archaeology.
A helpful prompt is more boring. For example: implement CSV export for the invoices page, knowing that the table is in app/(dashboard)/invoices/page.tsx, the queries are in src/server/invoices.ts and there is already a similar pattern in app/(dashboard)/reports.
Then add clear constraints: don't change the database schema, don't add dependencies if a small utility is enough, keep the existing UI style. And close with the verification: npm test -- invoices and npm run build.
This type of brief is not to "explain better to the AI". It serves above all to make it clearer to you what you are delegating. If you can't write it down concretely, maybe the task isn't ready for an agent yet.
Three jobs that I willingly delegate
The first is repetitive but verifiable work: adding tests, migrating calls to a new internal API, updating imports, replacing deprecated components, fixing TypeScript errors. Here the agent can save hours and the risk is controllable.
The second is exploratory work: "find where this total is calculated", "explain to me why this test is fragile", "reproduce the bug and tell me which files seem to be affected". Even when it doesn't produce a patch right away, it can do useful reconnaissance.
The third is recurring maintenance work: small dependency updates, cleanup of old feature flags, summary of blocked PRs, checking of forgotten TODOs. It's not glamorous, but it's exactly the kind of work that tends to pile up.
Three jobs that I keep human
Product decisions remain human. If a change changes how a user pays, deletes data, sees prices, or understands a permission, I want a responsible person.
Security boundaries also deserve human attention: auth, roles, tokens, sensitive data logging, database migrations. An agent can help implement, but doesn't have to be the sole decision maker.
Finally, I keep everything that requires architectural taste human. An agent can propose a refactor, but understanding whether an abstraction is really necessary or whether we are just polishing a non-existent problem remains a job.
The review is not optional
The temptation, when an agent is good, is to trust the green of the CI. It's understandable. It's also when the problems start.
I always look at at least five things:
- Does the patch only solve the requested task?
- Did he touch files that had nothing to do with it?
- Do the tests cover novel behavior or just happy chance?
- Does the code follow local patterns?
- Are errors handled as in the rest of the project?
When something is wrong, feedback needs to be specific. “Fix it” is lazy. Better: this utility duplicates parseMoney into src/lib/money.ts; reuse that function, add a test for the EUR case and don't change the public API of the billing module.
Agents respond much better to small, verifiable comments. Curiously, so do the people.
Guardrails worth the effort
If an agent can read files, write code, and execute commands, it should be treated as a powerful process. There's no need for paranoia, you need hygiene.
Use separate worktrees or branches. So you can compare the diff, throw away failed experiments, and not mix the agent's work with what you were doing.
Limit permissions. Commands like rg, git diff, npm test and npm run build can be quite free. Deployments, database migrations, access to secrets and destructive commands must remain explicit.
Reduce network access when you don't need it. For many tasks, official documentation, package registry and specific internal services are sufficient. Less surface area, fewer surprises.
Track actions. When a patch arrives in review, you should be able to reconstruct prompts, commands executed, tests passed and files modified. Not to create bureaucracy, but to be able to understand what happened if something goes wrong.
An easy way to get started as a team
If I were to introduce agents into a small team, I would start without major revolutions.
I would create a agent-ready label for issues with clear scope. I would add a template with context, constraints and verification commands. I would ask for small PR, ideally under a few hundred lines. I would require testing or screenshots for visible changes. And above all I would keep a person responsible for the merge.
After two weeks I would look at the data: which tasks were really speeded up, which reviews were heavy, which prompts were confusing, which parts of the codebase are too fragile to delegate.
It's a less spectacular approach than "from today we'll do everything with the agents", but it's the one that allows you to get to the third week without regrets.
The most human part
The funny thing is that the more autonomous agents become, the more important the classic skills become again: writing a good ticket, making small cuts, creating tests, reading diffs, communicating trade-offs. The agent accelerates those who already know how to work well. It also amplifies the chaos of those who delegate badly.
So no, I don't see multi-agent workflows as a shortcut to stop doing engineering. I see them as a way to shift more energy to the parts that matter: deciding what to build, making sure it works, keeping the system understandable.
Agents can make great asynchronous colleagues. But an asynchronous colleague, to be useful, needs context, boundaries and review. Just like everyone else.