What we learned running agents at scale

We have been running autonomous agents in production for about a year now, across our internal tooling and Alumia. This post is a collection of the things that were not obvious at the start and that we think will be useful to other teams doing the same kind of work.

The first and biggest lesson. Reliability of an agent is almost entirely a function of how tightly you can observe what it is doing. Not the model. Not the prompt. Not the tool choice. The ability to see, in detail, every decision the agent made, why it made it, what output came back, and what the agent did next. When we first started, our observability was mediocre and our agents felt flaky. We rewrote the observability layer from scratch early in the year; the agents themselves barely changed; reliability from the user's perspective went up by a factor that felt enormous.

What does "good observability for agents" mean, concretely. Every step logged with its inputs, outputs, latency, and model used. A visual trace of the full agent run, navigable like a debugger. The ability to replay a run from any step forward. The ability to diff two runs — same starting state, different outcomes, why. Automatic redaction of sensitive data. Enough retention to go back weeks. This sounds straightforward. Almost every team we have talked to is missing half of it.

The second lesson. Failure handling is the product. The success case is easy. Any agent can complete a task when everything goes right. The user experience is determined by what happens when things go wrong — when the API returns a 500, when the model hallucinates a tool argument, when the task turns out to be ambiguous, when the environment the agent is operating in has changed since it last looked. Most agent frameworks we have used treat these cases as edge cases. In production, they are not edge cases. They are 20 to 40 percent of all runs depending on the workload. If the experience is bad in those cases, the product is bad.

Specific failure handling patterns that have worked. Deterministic retry policies for transient failures, not model-decided retries. Clear escalation thresholds — after N retries on the same step, stop the agent and surface the issue to the user. User-visible recovery actions — a one-click rollback if the agent did something unintended, an easy "pick up from here" button to continue from partial progress. Explicit uncertainty handling — if the agent is less than a confidence threshold on the next step, pause and ask the user rather than guess.

The third lesson. Cost management creeps up on you. At small scale, agent runs feel free. Each run is a few cents. At large scale, with long runs and expensive models, the cost compounds quickly. We have seen individual agent runs, unchecked, consume tens of dollars in model calls. That is not necessarily wrong — some workloads are worth it — but if you do not track it, you will not know until the bill arrives. We added per-run cost tracking, per-user cost limits, and per-workload budgets as part of the same observability pass. It turned out to be one of the highest-value additions, not just for finance but for product — the runs that were most expensive were often the ones that were going wrong.

The fourth lesson. Concurrency is hard. Agents do I/O. Most of their time is waiting for something — an API, a model, a tool. When you scale up, you want to run many agents in parallel. But agents sometimes need state that other agents are writing to. Sometimes they hit rate limits on shared APIs. Sometimes they create race conditions nobody thought about. We learned, the hard way, to assume from the start that agents will run concurrently, and to build all shared state with that assumption. The expensive refactors were the ones where we did not.

The fifth lesson. Context management is where most teams fail silently. An agent running on a long task accumulates a lot of context — things it has seen, things it has done, things it was told. Most teams we have talked to use a naive approach: append everything to the prompt, truncate when you hit the limit. That works fine for short tasks and breaks in subtle ways for long ones. We built a layered memory system, which is now one of our open-source packages. The short version: working memory for the current step, episodic memory for recent steps, semantic memory for things the agent has learned that persist across runs. The cost of building this is non-trivial. The cost of not building it shows up as quality degradation on long runs, which is very hard to diagnose.

The sixth lesson. Evals matter more than you think. We spent the first half of the year relying on user feedback to detect regressions. That is fine when you have ten users and slow when you have a thousand. Automated evaluations — a representative set of scenarios run on every change — catch regressions earlier and let you ship with more confidence. Building a good eval suite is itself a project; our current one has a few hundred scenarios and grows as we find new ways for agents to fail. This is another area where we ended up building an open-source library because the internal tooling was useful enough to share.

The seventh lesson. Safety boundaries have to be designed in, not added on. We added a "guard rail" layer fairly late and then had to retrofit it into everything. In hindsight, the boundary — what actions can the agent take autonomously, what requires approval, what is forbidden entirely — should have been the first thing designed. Every agent framework I have seen handles this as an afterthought. It is much easier to add capabilities within a tight boundary than to retrofit a boundary around permissive capabilities.

The eighth lesson, and the one we least expected. Users develop strong opinions about agent personality. Not its wit — its defaults, its verbosity, its confidence, how much it explains versus just does. These are design decisions and deserve the same thoughtfulness as visual design. We iterated on the "voice" of our agents four or five times over the year. The version we are happiest with is quiet, direct, admits uncertainty, and never pretends to have done something it has not. Users who have been with us from the start have told us the product is noticeably better not because of any capability change but because of the tonal shifts.

If there is a through-line to all of this, it is that the hardest parts of making agents useful are the ones that are not about models. The infrastructure around the model — observability, failure handling, cost, concurrency, memory, evals, safety, voice — is where the quality of the product actually comes from. The model matters, but only as one input among many. Teams that treat agent-building as model selection plus prompt engineering get middling results and cannot figure out why.

Everything we learned is reflected in the open-source infrastructure we publish and in Alumia itself. If you are starting down this path and want to skip the detours, start there.