Tokenmaxxing in the Anthropic Era: A New Opening for AI Startups
![]()
If AI can replace human work, corporate profit margins should improve.
This has been the big expectation around AI adoption over the past few years. Especially in software development, customer support, and back-office work, AI agents and AI coding tools were expected to reduce labor cost while raising productivity.
But recent news suggests this view may have been a bit too simple.
Labor cost may go down. On the other side, AI inference cost is starting to rise.
A symbolic example is Anthropic’s Claude Code.
According to Business Insider, Anthropic has raised its estimate of the average token cost per enterprise developer for Claude Code from $6 per day to $13 per day. The range that covers 90% of users has also been raised, from under $12 per day to under $30 per day. On a monthly basis, the estimate is about $150–250 per developer (Business Insider: Anthropic Doubles Estimate for Claude Code Token Spend).
Looking at the numbers alone, this may not seem like a big problem yet.
But when thousands of engineers across a company use it, and AI agents start to autonomously generate code, run tests, fix errors, and retry, the story changes.
AI cost does not stop at “X dollars per user per month” like traditional SaaS. The more it is used, the more cost piles up: tokens, API calls, context, retries, tool executions.
The word that captures this is Tokenmaxxing.
I do not see this only as a cost-increase story.
I think the wider Tokenmaxxing becomes, the more new room there is for AI startups.
The more seriously companies adopt AI, the more they need not only “a better model.” They need infrastructure to measure which model was used for which task, at what cost, and tied to what outcome.
In other words, the spread of Tokenmaxxing could become an important market opportunity for AI infrastructure startups working on model routing, AI agent monitoring, and cost-per-task measurement.
What is Tokenmaxxing
Tokenmaxxing is the situation where using more AI — especially more tokens — is itself praised as a sign of getting the most out of AI.
A token is the unit an LLM uses to process text, code, conversation history, tool results, and so on. Inputs, outputs, intermediate agent steps, long context, and retry logs all become cost.
What companies should really look at is not token consumption itself.
What they should look at is productivity in terms like the following.
| What to look at | Content |
|---|---|
| Which task it was used for | Coding, summarization, classification, research, support, etc. |
| How much it cost | Tokens, API, inference, tool execution cost |
| What outcome it produced | Adopted code, resolved tickets, time saved |
| Whether it is reasonable vs. human work | Cost, quality, speed, risk comparison |
In companies in early-stage AI adoption, this often gets reversed.
“Employees using AI are more productive.” “Employees using more tokens are more advanced in AI.” “The more agents you run, the more advanced you are.”
When this atmosphere takes hold, token consumption stops being an outcome metric and becomes an internal game score.
This is Goodhart’s law itself. When a measure becomes a target, the measure breaks.
In the past, using lines of code or commit count as a productivity metric caused developers to optimize for line count and commit count rather than real quality.
The 2026 version of this metric may be “token consumption.”
Company cases in recent news
Tokenmaxxing is not an abstract concept. At several companies, rising AI usage and AI cost are already becoming an issue.
| Company | Reported development | What to look at |
|---|---|---|
| Anthropic / Claude Code | Raised the assumed per-developer cost for Claude Code | The more AI coding spreads, the more important inference-cost management becomes for customers |
| Uber | Reported to have burned through its 2026 AI budget in a few months due to rising use of Claude Code, Cursor, and others | When AI tools are too convenient, usage outruns the budget model |
| Amazon | Reported that internal AI usage scores and leaderboards encouraged unnecessary AI use | Making AI usage too visible turns metrics into a game for employees |
| Meta | A token consumption ranking called “Claudeonomics” was created, and over 60 trillion tokens were reportedly used in 30 days | Token use starts to look like a measure of “AI engagement” |
| Spotify | After layoffs, computing cost per employee reportedly rose because remaining staff use AI tools more | Headcount cuts and AI cost increases happen at the same time |
| Shopify / Roblox | Rising use of AI assistants and developer AI features is reported to be pushing inference cost up | The more users use these features, the more LLM cost the provider bears |
On Uber, AI Magazine reports that the company burned through its 2026 AI budget in a few months, with rising use of Claude Code and Cursor cited as the background (AI Magazine: Why Uber has Already Burned Through its AI Budget).
On Amazon, it is reported that some employees used an internal AI tool called “MeshClaw” and assigned unnecessary tasks to AI agents to inflate their usage scores. The article also describes a goal that more than 80% of developers should use AI on a weekly basis, along with a token-consumption leaderboard, which unintentionally encouraged competitive behavior (Financial Times: Amazon staff use AI tool for unnecessary tasks to inflate usage scores).
At Meta, a ranking called “Claudeonomics”, reportedly created by an employee, made token usage by about 85,000 employees visible, and over 60 trillion tokens were used in 30 days (Fortune: A Meta employee created a dashboard so coworkers can compete to be the company’s No. 1 AI token user).
For Spotify, Shopify, Roblox, and others, The Information reports that while AI is reducing labor cost, AI tool and LLM inference cost are starting to put pressure on profit margins (The Information: Tech’s AI Margin Math Is Getting Messier).
The point here is not to read these as simple “AI cost failures.”
It is more useful to read them as the next infrastructure layer becoming visible, now that companies are actually using AI in production.
The core: replacing labor with compute resources
AI adoption is not just cost reduction.
What AI reduces is human work hours. What rises in return is the cost of compute resources: GPUs, APIs, tokens, storage, logs, monitoring, security, and evaluation.
In other words, the cost structure of companies is shifting like this.
| Before | After AI adoption |
|---|---|
| Labor cost | AI inference cost |
| SaaS monthly license | Token usage billing |
| Human work hours | Agent execution time |
| Manager progress checks | Logs, traces, evaluation, monitoring |
| Outsourcing fees | Model usage fees, API costs, cloud costs |
This is not just efficiency improvement. The cost structure of companies is moving from labor-intensive to compute-intensive.
In that sense, the metric companies should watch is not “how much AI was used.”
What they should watch is cost per task.
This is where the room for AI startups opens up.
It is not easy for a company to analyze every AI usage in detail on its own, measure cost-effectiveness per model, monitor agent behavior, and route work to the optimal model.
I think this “operational management of AI usage” is the next startup area.
The next metric is cost per task
Cost per task is the idea of looking at how much AI cost it takes to complete a single task.
For example, you do not need a Claude Opus class expensive model to reply to an email. For short summaries, classification, tagging, and template generation, a small or low-cost model is often enough.
On the other hand, for complex code edits, legal documents, M&A models, medical records, and financial risk analysis, you need higher-end models and monitoring.
For the same kind of AI usage, the optimal model differs by task.
| Task | Suitable model / setup |
|---|---|
| Simple classification | Small model |
| Short summary | Low-cost model |
| Internal document search | RAG + mid-tier model |
| Code edits | High-end model |
| Legal / medical / finance | High-end model + monitoring |
| Routine monitoring agents | Lightweight model |
| Final decisions | High-end model + human review |
The point is not to use the highest-end model for everything. It is to use expensive models only where they are needed.
The effect of AI adoption is going to be measured by cost-effectiveness per task, not by token consumption.
A system that measures and improves cost per task becomes new AI infrastructure for companies.
Model routing as a startup opening
In this flow, model routing becomes important.
Model routing is the practice of not relying on one AI model for everything, but using multiple models depending on task difficulty, required accuracy, speed, cost, and risk.
For example, decisions like the following can be automated.
“Does this task really need Claude Code?” “Is a GPT-class model enough for this process?” “Would an open-source model like Llama or Mistral be sufficient?” “Couldn’t expensive models be used only for final verification?”
One representative example is Martian.
Accenture announced an investment in Martian in 2024, describing Martian as a company that dynamically routes queries to large language models and delivers more effective AI systems to enterprises. Martian itself describes its model router as a system that dynamically selects the best AI model for each query, optimizing performance, cost, uptime, and other business requirements (Accenture Newsroom: Accenture Invests in Martian to Bring Dynamic Routing of Large Language Queries, Martian: Partners with Accenture, Launches Airlock Compliance for Enterprises).
Just as companies stopped thinking directly about physical servers in the cloud era, in the AI era, instead of humans deciding “which model should we use” each time, a router may end up selecting the optimal model for each task.
Why big AI labs find this hard to take on seriously
What makes this area interesting is that big AI labs are structurally in a hard position to take it on seriously.
OpenAI, Anthropic, Google, and Microsoft want their own models and their own clouds to be used more. But a neutral model router will sometimes decide:
“This task does not need expensive Claude.” “For this process, a cheaper open-source model is enough, instead of GPT.” “Use Llama, not Gemini.” “High-end models are only needed for the final verification.”
This is rational for customer companies. But for big AI labs, it could cut into their own token revenue.
Of course, big AI labs will also push their own internal routing, lightweight models, caching, batching, and inference optimization.
But across companies, the work of neutrally judging “for that task, another company’s model is cheaper and good enough” is easier for a startup to build. That is where AI infrastructure startups have room to enter.
We are moving from an era where the model itself is the value, to an era where the value is in how you select, combine, monitor, and measure cost-effectiveness of models.
AI agent monitoring as another opening
Another important area is the AI agent monitoring layer.
AI agents do not just produce a final answer. They plan, search, call tools, retry on failure, evaluate intermediate results, and finally produce output.
The Tokenmaxxing problem happens along the way.
| Problem | Content |
|---|---|
| Unnecessary model calls | Expensive models used even for small decisions |
| Excessive retries | Failed processes repeated many times |
| Context that is too long | Unnecessary conversation history and logs kept around |
| Wasted tool execution | Repeated unnecessary searches and API calls |
| Disconnected from outcomes | Tokens used, but no outcome produced |
A company worth watching here is Judgment Labs.
In May 2026, Judgment Labs announced that it had raised a combined $32M in seed and Series A, led by Lightspeed Venture Partners. The company is described as building the continuous improvement layer for AI agents (Business Wire: Judgment Labs Closes $32M in Seed and Series A Funding).
This is not just log management.
It is a layer that ties AI agent action logs to each company’s outcome metrics.
For customer support, the outcome metrics may be ticket resolution rate, repeat-contact rate, and customer satisfaction. For code generation, it may be PR acceptance rate, test pass rate, and post-fix bug rate. For sales support, it may be reply rate, opportunity conversion rate, and close rate.
Outcome definitions differ by company.
A monitoring layer like Judgment Labs ties these outcome metrics to AI agent action logs and tries to answer questions like:
At which step did it fail? Where were tokens wasted? Which model calls were unnecessary? Which processes did not lead to outcomes?
Once this becomes visible, companies can measure cost per task for the first time.
In other words, the Tokenmaxxing problem is creating not only model routing, but also AI agent monitoring as a new startup area.
What investors should look at
From an investor perspective, this theme is not just an AI tool cost issue.
It shows that the revenue structure of the AI era is starting to split into three layers.
| Layer | Examples | Revenue opportunity |
|---|---|---|
| Model providers | Anthropic, OpenAI, Google | Rising token consumption leads to rising revenue |
| AI-adopting companies | Uber, Spotify, Shopify, Roblox, etc. | Balancing labor cost reduction against rising inference cost is the challenge |
| AI infrastructure optimization | Martian, Judgment Labs, etc. | Cost management, model selection, monitoring, and evaluation become value |
For Anthropic, the spread of Claude Code is a clear tailwind.
The more Claude Code grows, the more Anthropic’s revenue grows. But for customer companies, AI cost management becomes more important by the same amount.
In other words, Anthropic’s growth may also create demand for AI infrastructure companies like Martian and Judgment Labs.
This is similar to how the growth of AWS, Azure, and Google Cloud in the cloud market led to growth in surrounding layers like Datadog, Snowflake, Cloudflare, and FinOps tools.
The more model companies grow, the more the companies using them need cost management, monitoring, security, evaluation, and routing.
This is where AI startups have room to enter, in a way that big AI labs cannot easily match.
Another view
There is also another way to look at this.
Rising AI cost is not necessarily bad news. It can be read as evidence that AI tools are actually being used.
If, like at Uber, most engineers use AI tools and a large share of code is generated by AI, then it is natural for cost to rise in the short term.
New infrastructure always carries some waste in the early phase.
Cloud was the same at first. You could scale up quickly, but idle instances, excess storage, unnecessary logs, and wasted data transfer cost piled up.
Later, the cloud got FinOps, observability, security, and cost-optimization tools.
The same thing may simply be happening with AI.
In other words, Tokenmaxxing is not an AI adoption failure. It is an operational issue that only became visible because AI moved into production use.
And the fact that the operational issue has become visible means that a market is starting to form for the startups that will solve it.
Summary
Tokenmaxxing is a side effect of the early stage of AI adoption.
I do not see this only as a corporate cost management issue. I see it as a new entry opportunity for AI startups.
Big AI labs like Anthropic and OpenAI are basically built so that their revenue grows as their own model usage grows. For that reason, it is structurally hard for them to build a system that neutrally decides “this task does not need expensive Claude” or “for this process, a cheaper open-source model is enough instead of GPT.”
For companies adopting AI, on the other hand, throwing every task at the most expensive model is not rational.
Small models for simple classification. Low-cost models for short summaries. High-end models for complex code edits. High-end models with monitoring for high-risk areas like legal, medical, and finance.
Selecting the optimal model per task, monitoring agent behavior, and measuring AI cost per outcome — this kind of system becomes necessary.
This is where model routing companies like Martian and AI agent monitoring companies like Judgment Labs have room to enter.
Just as the spread of cloud created Datadog, Cloudflare, FinOps, and security companies, the spread of AI usage should also create neutral infrastructure companies that optimize model usage, not only companies that build models.
The core of Tokenmaxxing is, at the same time, both a corporate AI cost problem and a new market entrance for AI infrastructure startups.Join the conversation on LinkedIn — share your thoughts and comments.
Discuss on LinkedIn