Grab just published a case study on how their Analytics Data Warehouse team built a multi-agent AI system to handle internal engineering support at scale.
The ADW platform serves over 1,000 internal users, manages more than 15,000 tables, and sits at the core of Grab's analytics infrastructure. As the platform grew, engineers were spending most of their time on SQL debugging, log investigation, and repetitive support requests instead of doing actual platform work.
Their solution: a multi-agent system built on LangGraph and FastAPI that routes requests intelligently and handles the heavy lifting.
But the most valuable part of this case study is not the system itself. It is the three architectural decisions they made along the way. Each one goes against what most builders instinctively do.
Lesson 1: Split Your Agents by Intent, Not Just by Task

Grab designed two completely separate workflows inside their system.
The first handles Investigation: query analysis, log retrieval, schema lookup, root-cause analysis, and summarization. Read-only. Analytical. Safe to run freely.
The second handles Enhancement: generating code changes, SQL fixes, merge requests, and automation tasks. These touch real systems and carry real risk.
This is not just a clean architecture decision. It is a safety and reliability decision.
When one agent is responsible for both reading and writing, its reasoning chain gets polluted. An agent that analyzed logs five steps ago and is now generating a schema fix has to carry too much conflicting context. Errors become harder to trace. Outputs become less predictable.
The pattern here is called Separation of Concerns, and it has been a software engineering principle for decades. What Grab validated is that it applies just as strongly to AI agents as it does to traditional code.
The takeaway for builders: If your agent does more than one fundamentally different type of work, split it. Analyst agents and executor agents should not share the same reasoning chain.
Lesson 2: Fewer Tools, Better Agents

This is the most counterintuitive finding in the entire case study.
Grab initially exposed over 30 internal tools to their agents. SQL tools, logging systems, metadata access, code search, Git workflows. The logic made sense at the time: more capability means more powerful agents.
The result was the opposite.
When agents had 30+ tools available, tool selection became unpredictable. The reasoning flow grew more complex. Reliability dropped in production.
Grab's fix was to reduce the toolset to a curated, smaller set. Reliability went up.
This problem has a name in the research community: action space explosion. When a model has too many choices at each step, the probability of selecting the right action decreases, especially as task complexity increases. It is well-documented in reinforcement learning research, and it translates directly to LLM-based agents.
The instinct most builders have is to add more tools to make agents smarter. The data says the opposite. A smaller, well-designed toolset consistently outperforms a large, generic one.
The takeaway for builders: Before adding another tool to your agent, ask what the agent actually needs to complete its specific job. The answer is almost always 3 to 5 tools, not 30.
Lesson 3: Human-in-the-Loop Is a Feature, Not a Limitation

Grab does not allow AI to deploy code autonomously.
Every enhancement workflow that produces code changes requires human review and engineer approval before anything reaches production. On top of that, the system includes SQL validation layers, sensitive data protection, and exposure risk detection.
This is not a sign that the system is not mature enough. It is a deliberate design choice by engineers who understand what failure looks like at infrastructure scale.
Think about what a single bad autonomous deployment could do to a system managing 15,000 tables serving 1,000+ users. The blast radius is not recoverable in minutes.
NIST's AI Risk Management Framework and EU AI Act both define human oversight as mandatory for high-stakes automated systems. But Grab did not need a regulatory framework to reach this conclusion. They reached it by thinking through failure modes.
This is a critical point for anyone selling AI automation to B2B clients. The argument is not "our AI is so good it needs no oversight." The argument is "our AI is well-governed, which is why you can trust it in production."
The takeaway for builders: Human review layers do not slow down AI systems. They are what makes AI systems safe enough to scale. Build review gates early. Do not retrofit them later.
The Bigger Pattern
What Grab built is not a chatbot with multiple personalities. It is a production system with workflow orchestration, tool governance, context management, and structured human oversight.
They also solved a hard technical problem that does not get talked about enough: context window management at scale. Keeping enough information for agents to reason well without overloading the context window required structured compression, selective retrieval, and filtering strategies.
Most demos skip this. Production systems cannot.
The broader trend here is also worth noting. Large-scale companies are not betting on one superintelligent AI that handles everything. They are building networks of specialized agents with clear responsibilities, controlled tool access, and human review at critical decision points. Stability and predictability are winning over raw capability.
That is not a limitation of current AI. That is engineering maturity.
If you are building agentic systems right now, these three principles are worth pressure-testing against whatever you are shipping:
Are your investigation and execution concerns separated?
Are you adding tools because you need them or because you can?
Where are the human review gates in your deployment flow?
The answers will tell you a lot about how your system will hold up in production.
