When the Token Bill Comes Due: What Uber and Microsoft Just Taught the Rest of Us About Renting Intelligence

In late May, Goldman Sachs put a number on something a lot of operators have been feeling in their gut for months. Agentic AI, the bank projects, could push token demand up by more than 24 times in the next few years. Read that again. Not 24 percent. Twenty-four times.

If your AI runs on someone else’s meter, that forecast is not a growth story. It’s an invoice you haven’t opened yet.

And the companies you’d expect to absorb that hit better than anyone are the ones flinching first. Uber reportedly burned through its entire 2026 AI budget in a matter of months. Uber’s CTO went public with it, and the operations chief, Andrew Macdonald, told Business Insider something even more damning than the overspend: after talking to his senior engineers, he couldn’t find a clear line between how many tokens the company was burning and how many features customers actually got. More than 80 percent of Uber’s engineers were using agentic tools. Over 60 percent of the code was AI-generated. And it still wasn’t worth what they were paying for it.

Microsoft, meanwhile, started pulling its own developers off a third-party coding assistant and moving them onto an in-house tool, with a deadline that landed conveniently at the close of its fiscal year. The official line was consolidation. The timing told a different story. Microsoft also flipped one of its developer products to token-based billing because the cost of running it had ballooned.

When the two companies that helped write the playbook for aggressive AI adoption are both quietly restructuring how they buy it, that’s not a blip. That’s the meter catching up with the marketing.

The math nobody put on the slide

For two years, the pitch for cloud AI has been simple: usage is cheap, it’ll only get cheaper, and you can scale infinitely. The first part was true for a chatbot answering one question at a time. It stops being true the moment you point an agent at a real workflow.

A single agentic task can consume more than a thousand times the tokens of a one-shot chatbot query. Agents don’t ask once. They plan, call tools, check their own work, retry, and chain steps together, and every one of those steps is metered. Multiply that by a whole department running agents all day, then layer on the Goldman Sachs 24x demand curve, and the “it’ll get cheaper” story collapses under its own arithmetic.

The numbers coming out of the industry have started to sound less like efficiency and more like a dare. Nvidia’s CEO said earlier this year that if one of his $500,000 engineers wasn’t burning at least $250,000 in tokens, he’d be worried. Airbnb’s CEO bragged that 60 percent of the company’s code is now AI-generated. One reported that 84 percent of its code was AI-written. A three-person team running an aggressive stack of agents managed to spend $1.3 million in tokens in a single month.

Somewhere in there, the conversation quietly stopped being about results and started being about consumption for its own sake. Token usage became the brag, as if the size of the bill proved the value of the work. Uber just demonstrated, in public, that it doesn’t.

Consumption is not a strategy

Here’s the part that should land for any business owner watching this from the outside: the meter punishes exactly the usage you’re being told to chase.

You are encouraged to put AI into everything, hand the agents more autonomy, let them run longer and reason harder. Every one of those instructions increases token consumption. So the more seriously you take the advice, the faster your costs compound, and they compound on a curve you don’t control and can’t predict. You find out what you owe after the work is done. That’s a brutal way to run a budget, and it’s an impossible way to run a small or mid-sized organization that needs to know its number before the quarter starts, not after.

This is the question we ask clients to sit with before they sign anything: what happens to our costs if our usage succeeds? If the honest answer is “they go up in a way we can’t forecast,” then the platform isn’t priced for you to win. It’s priced for you to ration.

And rationing is precisely what’s happening. The biggest names in tech are now teaching their people to use less of the very tools they spent a year telling everyone to use more of. If Microsoft and Uber can’t make consumption-based AI pencil out at their scale, the odds that a 40-person law firm or a boutique advisory shop will are not good.

The hardware cavalry isn’t coming in time

The usual reassurance is that better chips will rescue the economics. Next-generation inference hardware is genuinely more efficient, and the Goldman Sachs report leans on exactly that hope: cheaper tokens, usage keeps climbing, profits eventually follow.

The timing doesn’t cooperate. The newest platforms are still rolling out, and the efficiency gains, real as they are, are years from deploying at the scale this demand curve requires. In the meantime, more than half of the data center projects planned around the current generation of hardware have reportedly been delayed or cancelled, choked by shortages of power and parts. The hyperscalers themselves have started stretching their hardware to run for six years instead of replacing it on the old cadence, which is hard to square with the promise of a dramatic efficiency leap every single year.

So the demand is exploding now. The relief is theoretical and late. And the gap between the two gets paid for in your monthly bill.

There is another way to buy this

None of this is an argument against AI. We build our business on AI. It’s an argument against renting your intelligence by the drink from infrastructure you don’t control, priced on a model designed to climb.

At Modular Technology Group, we made a different bet, and the news this month is the reason we made it. Modular runs private AI on infrastructure we own, in a US-based FedRAMP data center, at a fixed monthly price. No per-token billing. No per-query billing. No consumption meter quietly compounding in the background while your agents do exactly what you asked them to do.

When the AI runs on hardware you control, the equation flips. Heavier usage doesn’t mean a heavier invoice. Once the box is yours, running more agents, longer reasoning, bigger context, all of it lives inside a cost you already know. The incentive inverts: instead of being penalized for using AI more, you’re free to. That’s the difference between intelligence as a metered utility and intelligence as owned capability.

A few things follow from owning the stack instead of renting it:

Your costs are knowable before the work starts. A flat monthly fee means the budget conversation happens once, up front, not in a panicked review when the usage report comes in. No surprises, no variable cloud bill, no quarter blown in a month.

Your usage can succeed without punishing you. The whole point of AI is to do more with it over time. On a metered model, success is the thing that breaks your budget. On owned infrastructure, success is just success.

You’re not locked to one vendor’s pricing whims. Microsoft just moved a product to token billing because its own costs ran away. When you don’t own the layer your business depends on, someone else’s cost problem becomes your pricing problem overnight. We run the model that fits the job, on hardware that’s ours, so a vendor’s repricing isn’t your emergency.

Your data stays yours. This was always the foundation. Models run locally, on our infrastructure, in our facility. Your data never routes through someone else’s cloud to get answered. Your data, your rules. The cost predictability is a benefit that rides on top of the same architecture that keeps your information private in the first place.

The meter was always the business model

The token-billing crisis isn’t a bug in cloud AI. It’s the business model working as designed. Usage was always going to climb, agents were always going to multiply the consumption, and the bill was always going to follow the curve. May just happened to be the month some very large companies looked up and noticed.

The organizations that come out of this ahead won’t be the ones who used AI the least to survive the bill. They’ll be the ones who stopped renting intelligence by the token and started owning it, so that using more was never the thing that hurt them.

If you’re staring at an AI bill that grows every time the tools actually work, that’s worth a conversation. We’re always happy to compare notes on what fixed-cost, private AI looks like for an organization your size. You can reach us at modtechgroup.com/consultation.

Because when the token bill finally comes due across the industry, you want to be the company that already knows its number.

Modular Technology Group builds and operates private AI infrastructure on owned, US-based hardware: fixed pricing, local inference, your data and your AI under your rules, from dirt to desktop. modtechgroup.com

AI Workspaces, Privacy, Security

When Google Validates Your Architecture: Private AI Was Never the Alternative
At Google Cloud Next 2026 in Las Vegas this week, Google made a quiet but significant announcement: Gemini can now run on a single air-gapped server, fully disconnected from the internet — and from Google itself.

Continue reading
AI Workspaces, Tips & Tricks

The Model That Barely Slows Down: Gemma 4 26B vs Qwen 3.6 35B at Long Context
We ran Gemma 4 26B and Qwen 3.6 35B-A3B head-to-head on the same server, same quantization, same protocol. Gemma 4 is 3.7× faster at 32k context — and 7.2× faster at 128k. The gap widens with context, and the reason reveals something important about model selection for long-context workloads.

Continue reading
AI Workspaces, Privacy, Tips & Tricks

Same AI Model, Two Hardware Tiers — And Why Context Length Is the Hidden Variable
We put Qwen 3.6 35B-A3B on a developer laptop and a dual-GPU server. The speed gap grows from 2.4× to 5.3× as context grows — and the real bottleneck turns out not to be compute.

Continue reading

AI Workspaces, Privacy, Security

When Google Validates Your Architecture: Private AI Was Never the Alternative
At Google Cloud Next 2026 in Las Vegas this week, Google made a quiet but significant announcement: Gemini can now run on a single air-gapped server, fully disconnected from the internet — and from Google itself.

Continue reading
AI Workspaces, Tips & Tricks

The Model That Barely Slows Down: Gemma 4 26B vs Qwen 3.6 35B at Long Context
We ran Gemma 4 26B and Qwen 3.6 35B-A3B head-to-head on the same server, same quantization, same protocol. Gemma 4 is 3.7× faster at 32k context — and 7.2× faster at 128k. The gap widens with context, and the reason reveals something important about model selection for long-context workloads.

Continue reading