Spinning up AI agents feels like hiring tireless employees. Operating a thousand of them feels like running a call centre.

At Relevance AI, customers can deploy AI agents that work around the clock, executing thousands of tasks a day across CRM updates, prospecting, research, and customer outreach. The promise is simple: set the agent live, and let it run.

But as our enterprise customers scaled their AI workforces from tens to thousands of daily tasks, a quieter problem surfaced. Every day, a percentage of those tasks would fail, get stuck, or surface for human approval. At a thousand tasks a day, even a 5% exception rate means fifty issues to triage. At Databricks scale, it meant hundreds.

I led the end-to-end design of this large ambiguous TaskOps project from discovery through to build, partnering closely with our Solution Engineering team and our tech leads.

Built for one agent, breaking under a thousand

The original Workforce experience at Relevance AI was designed when most customers ran a handful of agents. To review what an agent was doing, you opened that agent, scrolled its task timeline, and worked through anything flagged for attention one item at a time.

That model worked well at small scale but as customers grew their AI workforces, it stopped working entirely.

By the time I started this project, our largest enterprise customers were running 100,000+ agent tasks a month across dozens of agents. Errors, escalations, and pending approvals were spread across individual agent task lists, with no consolidated view, no way to spot patterns, and no way to action items in bulk. Operators were stuck inspecting tasks one at a time inside a side drawer, with no visibility into the bigger picture.

The product had a tasks page, individual agent task timeline, but neither answered the questions the AI ops persona was actually asking: What broke today? How big is it? How do I fix all of it at once?

User interviews revealed key painpoints

❌ Hard to identify common agent errors across a project

Errors are siloed per-agent with no way to spot patterns. A error hitting 3 agents looks like 3 separate problems, not one systemic issue. Operators can't prioritise what to fix first.

❌ Manually resolving each task issue was highly time-consuming

There's no way to take bulk action on a cluster of identical failures or similar approval requests.

❌ Three separate problems or one systemic one?

With no consolidated surface, an error hitting three agents looked like three issues. It was actually one.

❌ Alerts without actionability.

Existing alerts told operators that something happened, but didn't link them to a filtered view to diagnose and triage. Operators had to manually hunt down what alerts were actually about.