AI Agents Simulated a Supply Chain and Lost to Humans

Claude Opus 4.8, today's flagship model, orchestrated the run, breaking the game into one isolated decision per station and handing each to a sub-agent. The agents started from the same position as the human teams, under the same rules. But with no feedback loops between the roles, the inefficiencies did not cancel out. They amplified, and the AI chain ended up costing roughly three times the humans who played the same game.

The Beer Game is a supply-chain simulation built at MIT in the 1960s. Four stations sit in a line: a retailer, a wholesaler, a distributor, and a factory. Beer flows downstream toward the customer, and orders flow upstream toward the factory. Each station sees only its own desk, meaning its own inventory, the order that just arrived from below, and the shipment that just landed from above. Nobody talks to anyone else. Every week each station decides one number: how many cases to order from its supplier. Orders and shipments both take time to arrive, so a decision made today does not land for weeks.

The game has one famous lesson. A small, one-time change in customer demand, passed up a chain of people who cannot see past their own station, turns into wild swings in orders and inventory. The retailer nudges its order up. The wholesaler, seeing that nudge on top of its own thinning shelf, orders a bit more. The distributor more again. By the time the signal reaches the factory, a change of a few cases has become an order for dozens. This is the bullwhip effect, and it is expensive: every unit held in inventory costs money each week, and every unit a customer wants but cannot get costs more.

We ran the game with AI. We asked Claude Opus 4.8, the flagship model at the time of writing, to design and run the experiment, and it did what frontier models now do with a large task: it broke the work into pieces and handed each piece to a separate sub-agent. The workflow it wrote spins up sixteen agents, one per station across four teams, and gives each one only its own station's view through the same interface a human player uses. Those sub-agents ran on a smaller, faster model. None of them could see another station, another team, or one another. They could not coordinate. Every week, each agent read its own numbers and placed its own order.

That structure is the point. Decompose a goal, distribute the parts to a fleet of agents that each work in isolation, collect the results: this is how state-of-the-art AI increasingly operates. So this is not a toy about a board game. It is a clean look at what happens when that pattern meets a problem where the parts depend on one another. We compared the outcome to human teams who played the identical game in person, and to the cost of a perfectly coordinated chain.

Over the same 32 weeks, the AI teams cost 2.6x the human teams playing the identical game, and 3.2x a perfectly coordinated chain.

Why this matters

Companies are starting to hand routine operational decisions to fleets of AI agents. The Beer Game is a clean test of what happens when capable agents act on local information with no shared view. The pattern here, more reaction and more cost the further you move up the chain, is the same pattern that appears whenever an organization loses the loops that carry information between the people making decisions.

How we ran it

Players

Four AI teams (16 agents, one per station) executing on a smaller fast model, against two human teams who played the identical ruleset in person.

Orchestration

Claude Opus 4.8, the flagship model at the time of writing, designed and ran the experiment. It authored a workflow that decomposes the game into one decision per station per week and fans each one out to an isolated sub-agent, the same decompose-and-distribute pattern behind current agentic AI.

The one shock

Customer demand was flat at 4 units for weeks 1 to 4, then stepped once to 8 from week 5 on. That single change is the only disturbance in the game.

Cost

Holding inventory costs 1 per unit per week. A backorder you cannot fill costs 2 per unit per week. Whole-chain cost sums all four stations.

Isolation

Each agent saw only its own inventory, incoming order, and incoming shipment, through the same API a human uses. No agent could see another station, another team, or the real game length.

The ideal benchmark

The engine's steady-ordering chain, where each station simply relays demand upstream. A disciplined pass-through baseline, not a strict optimum.

Length

The AI game ran 36 weeks; the human game was stopped at 32. Every cross-cohort comparison below uses the matched 32-week figures or states the length.

Cost ran away from the ideal

Through the first ten weeks every team, human and AI, tracked the ideal line closely. The chain starts in balance, and the demand step had not yet worked its way upstream. Then the AI teams separated. From around week 12 their cumulative cost climbed steeply and kept climbing. The human teams bent only slightly above the ideal and stayed there.

Figure 1Cumulative whole-chain cost. The four AI teams (orange) pull away from the ideal after week 12; the two human teams (blue) stay close to the steady-ordering benchmark.

The four AI teams finished between 4,787 and 7,160 in whole-chain cost over 36 weeks, a mean of 5,852. The human teams finished at 1,564 and 2,064 over 32 weeks. The AI cost did not spread evenly across the chain. It piled up in the wholesaler and distributor seats in the middle, where each agent reacted to a swing that was already amplified below it, then amplified it again. The retailers, closest to the steady real demand, stayed cheap.

Whole-chain cost by station, the four AI teams (36 weeks)
AI team	Retailer	Wholesaler	Distributor	Factory	Total
Teal	656	1,242	1,684	1,205	4,787
Amber	812	1,863	1,844	856	5,375
Indigo	996	1,572	2,384	1,134	6,086
Crimson	854	2,144	2,658	1,504	7,160

Orders amplified up the chain

The bullwhip shows clearest in the orders themselves. Customer demand never rose above 8 units. Yet the further an AI agent sat from the customer, the larger its biggest order grew. The human orders barely grew up the chain at all.

Figure 2Mean peak weekly order by role. For the AI teams the peak order climbs steeply from retailer to factory; for the humans it stays close to true demand.

Averaged across teams, the AI retailer's largest order was 16 units, the wholesaler's 32, the distributor's 54, and the factory's 59. The worst AI factory ordered 80 units in a single week, ten times true demand. On the same measure the human stations sat at 9, 13.5, 16, and 18, and the worst human factory peaked at 20.

Mean peak weekly order by role. Customer demand never exceeded 8 units.
Station	AI agents	Humans
Retailer	16	9
Wholesaler	32	13.5
Distributor	54	16
Factory	59	18

Watching one chain week by week shows the mechanism in motion. In Crimson, the most expensive AI team, the retailer's orders stayed near real demand the whole game, while the distributor and factory swung between near zero and 80 and back, chasing a backlog that their own earlier over-ordering had created.

Figure 3Weekly orders by role for Crimson, the worst AI team. The retailer (closest to the customer) stays calm; the distributor and factory whipsaw.

The swings left mountains of stock. The AI distributor seat peaked at 150 units of inventory on one team while running deep backlogs earlier in the game, the classic overshoot-then-overcorrect cycle. No human station ever held more than 39 units.

Same game, a fraction of the cost

Because the human game ran 32 weeks and the AI game 36, we read each AI team's cost at week 32 and set it beside the humans' final cost. The gap holds.

Figure 4Mean whole-chain cost at week 32. The AI teams sit at 3.2 times the coordinated ideal; the humans, playing the same game, at 1.23 times.

At week 32 the AI teams averaged 4,709 in whole-chain cost. The humans averaged 1,814. A perfectly coordinated chain would have cost 1,472. The humans landed 23 percent above the ideal. The AI teams landed 220 percent above it, playing the same game under the same rule that nobody may talk.

What this means

The agents were not short on capability. Each one could read its station and do the arithmetic. What they lacked was a shared view and any feedback loop to the rest of the chain. Starting context, rules, and demand were identical to the humans'. The difference was the missing connection between roles. Sitting alone with a local signal, every agent treated a temporary backlog as a reason to order more, and that extra order became the next station's shock.

Why did the humans, under the same no-talking rule, stay calmer? Here we can only offer a hypothesis. Even without speaking, people in a shared room read a slower pace, a facilitator's rhythm, and each other. The agents had none of that. Reacting quickly to the number in front of them, in complete isolation, they did not damp the signal. They amplified it. This is an interpretation of why the cohorts differed, not a measured cause.

The AI had more context, not less

Were the agents underinformed? Not at all. To build the experiment, the orchestrating model read the simulation's entire codebase. It understood every mechanic in detail: the two-week shipping and ordering delays, the exact cost of holding versus backordering, the single demand step at week five. Human players were given the rules of a board game. The model was given the source code.

So context was not the human advantage. What several of the humans had instead was experience. For some of them, this was not their first time at the table. Experience is not the same as context. Context is what you are told before you start. Experience is the judgment you reach for in the moment, especially when things go wrong. A player who has already watched a backlog spiral once does not need to be told to hold their order. They feel it coming, and they wait.

Context engineering is useful, but it has its limits. The agents knew the rules cold. But they had never been burned by them.

A local maximum that looks like progress

Step back and this is a local-versus-global maximum problem. Each agent climbed its own hill: clear my backlog, refill my shelf, cover the order in front of me. Every one of those moves was locally sensible, and locally it looked like progress. More orders placed, more units in motion, more inventory built. At nearly every station the AI chain was busier than the human chain. It was also three times further from where the whole chain should have been, the entire time.

That is the uncomfortable part. The failure did not look like failure. It looked like output. A station buried in inventory and a station ordering eighty units against a demand of eight are both, locally, doing something. The motion is real. The units are real. The activity reads as performance. The only thing missing is the one thing that would reveal the problem, which is a view of the whole.

It is worth asking what happens as this pattern spreads. As teams lean harder on AI to produce and decide, the first sign is not less output. It is more: more drafts, more tickets, more analyses, more decisions, each one defensible on its own. The quiet cost is the shared context that used to keep everyone aimed at the same global target. This experiment cannot prove that organizations are drifting this way. It can only show, in a controlled setting, how easily a fleet of capable, isolated optimizers ends up far from the global best while every local reading says things are fine. Call it being blinded by the performance of it all: so much visible motion that nobody stops to check the map.

There is a second cost worth naming. Playing a game that people run with pen, paper, and poker chips took the agents 592 separate decisions and about 13.6 million tokens, across 1,769 tool calls and 37 rounds. Isolation is expensive twice over: once in the inventory bill, and again in the compute spent generating it.

The takeaway

The lesson is not that AI agents cannot run a supply chain. It is that a fleet of capable agents, each optimizing its own desk without a shared signal, reproduces the oldest failure in operations. Deployed this way, agents do not remove the bullwhip. They automate it, and they make it look like productivity.

Limitations

One run. A single session of four AI teams against two human teams, not a controlled study. The numbers describe what happened here, not a population.
Different lengths. The AI game ran 36 weeks and the human game 32. Every cross-cohort comparison uses the matched 32-week figures or states the length.
Naive agent policy. The agents ran on a simple prompt and a small fast model, with no memory of strategy between weeks and no tuning. A different prompt or a reasoning model could behave very differently.
Experienced human players. Some of the human team had played the Beer Game before, so this is seasoned humans against first-attempt agents. We treat that experience as part of the result, not a flaw to control away.
Small, non-random cohort. Four AI teams and two human teams. The human players were not sampled at random.
A simulation, not an operation. Real chains carry prices, variable lead times, and forecasts the game leaves out.

References

Sterman, J. D. (1989). Modeling Managerial Behavior: Misperceptions of Feedback in a Dynamic Decision Making Experiment. Management Science, 35(3). Origin of the Beer Game and the bullwhip result.
Happily Labs (2026). AI Agents vs Humans in the Beer Game. Internal experiment on beergame.happily.ai: 4 AI teams (36 weeks) and 2 human teams (32 weeks). Figures computed from the game ledger.