Agent Cost Observability: From Personal Token Budget to Team-Wide
GitHub Copilot moved to usage-based billing today. Premium Request Units are gone; AI Credits replace them; every plan now meters input, output, and cached tokens against a published per-model rate (GitHub Blog, 2026). The 4.7 million paid Copilot subscribers who woke up on a flat-rate plan yesterday inherit a variable per-developer cost line item that did not exist on Friday. That is the news. The deeper shift is that 98% of FinOps practitioners now manage AI spend, up from 31% in 2024 (State of FinOps 2026, 2026). The discipline crossed from “early adopter” to “default” in 24 months. Most engineering leaders are still running ccusage in their head.
Solo monitors solve the personal token budget. They do not survive the team. They cannot say “the payments-team’s autocomplete-rollout is consuming 80% of the daily AI Credit pool right now”; they cannot stop a runaway agent before it crosses the 100% line; and they cannot answer the EU AI Act’s “which agent, which user, which purpose” question that is nine weeks away. This post walks the build recipe for the team tier: three axes of attribution, two tiers of alerting, and the reason the same telemetry happens to be the substrate that finance, audit, and engineering all sit on.
Key Takeaways
- The discipline changed under everyone’s feet. 98% of FinOps practices now manage AI spend, up from 31% in 2024 and 63% in 2025; AI cost management is the #1 forward-looking skillset across organizations of all sizes (State of FinOps 2026, 2026).
- The 200-engineer math is no longer hypothetical. At Anthropic’s published $13 average per active developer day, 200 engineers across 22 active days is $57,200 a month from Claude Code alone (Claude Code costs docs, 2026). The average enterprise sits at $85,521 a month total AI spend; 40% are now over $10M a year (CloudZero, 2026).
- ccusage solved one developer. The 200-developer answer needs three axes of attribution: per-repo via OTel resource attributes, per-team via SSO and SCIM identity, per-feature via PR labels. No single source ships all three.
- Visibility is not control. The $47,000 four-agent loop fired Slack alerts at 50%, 80%, and 95% on the way to its eleventh-day total. Two-tier alerting (soft Slack cap, hard
PreToolUsehook) is the difference between watching and stopping.- Cost telemetry is compliance telemetry. The fields needed to attribute a $4,200 runaway agent are the fields the EU AI Act’s August 2 2026 agentic provisions require. Build cost first; the audit story falls out for free (EU AI Act portal, 2026).
The 200-engineer math
At Anthropic’s published $13 average cost per active Claude Code day, a 200-engineer team across 22 active days is $57,200 a month from one tool (Claude Code costs docs, 2026). CloudZero’s 2025 survey pegged the average enterprise at $85,521 a month in total AI spend (CloudZero State of AI Costs 2025, 2025); the 2026 update found 40% of companies now spending over $10 million a year (CloudZero, 2026). Gartner forecasts $2.52 trillion in worldwide AI spending for 2026, a 44% jump year over year (Gartner, 2026). The line item that did not exist in 2024 is now larger than most teams’ SaaS bill.
Multi-tool reality complicates the math. One team uses Cursor (64% of the Fortune 500 do, on a per-seat tier with an API-rate top-up since June 2025 (Cursor Enterprise, 2026)); another uses Copilot (4.7 million paid subscribers, today on AI Credits); a third runs Claude Code (Pro and Max plans with weekly rate caps active since August 2025). Each vendor publishes its own analytics surface. None publishes a unified one. The premium request multipliers tell you where the cost lands inside a single product: on Copilot, Claude Opus 4.7 is 15x and Opus 4.6 fast is 30x, while Sonnet 4.6 is 1x and Haiku 4.5 is 0.33x (GitHub Docs, 2026). At constant prompt count, model mix is a 30 to 90x cost spread inside one vendor.
The “global cap” answer breaks at 4pm Thursday. When a hard org-level limit fires and 200 engineers freeze, the on-call needs to know whose runaway is consuming 80% of the pool right now, not at end-of-month invoice reconciliation. Attribution is the routing information for incident response. The global cap is the floor; attribution is what lets you raise it from below. If you have not built the per-developer view yet, start with the personal token-tracking pipeline; the team-tier recipe assumes you understand the JSONL shape it scales out of.
Why ccusage stops working at 200
ccusage is the right tool for one developer. It parses ~/.claude/projects/**/*.jsonl from a single home directory and produces daily, monthly, and per-session token plus cost tables; the GitHub repo has 13.9 thousand stars and an MCP server flavor (github.com/ryoppippi/ccusage, 2026). At 200 it has three structural failure modes. You cannot ssh into 200 home directories to harvest JSONL files. The parser knows nothing about identity, repo, or feature; it cannot say which team or which work the spend is attached to. And it is observability-only: the dashboard shows the burn while it is happening, no enforcement layer ever fires.
The $47,000 four-agent LangChain loop is the canonical illustration. Cost trajectory: week one $127, week two roughly $891 incremental, week three $6,240 incremental, then escalation to $47,000 over eleven days (Tech Startups, 2025). Helicone dashboards displayed the curve. Slack alerts fired at 50%, 80%, and 95% of the OpenAI account-level cap. None of them stopped the agents. The April 2026 $4,200 incident from a CRM-sync agent retrying through HTTP 429s is the smaller-scale companion: $42 in hour one, $200 by hour four, $1,000 by hour twelve, $4,200 by hour 63 (Sattyam Jain, 2026). Watching is not stopping.
The replacement layer is one of two pipelines. Either a daily cron that rsyncs ~/.claude/projects/ to S3 and queries it via Athena (good for offline diagnosis; bad for real-time alerting); or the Claude Code Analytics API endpoint GET /v1/organizations/usage_report/claude_code, which returns daily per-user, per-model rollups including tokens.input, tokens.output, tokens.cache_read, and estimated_cost.amount in cents USD (Claude Code Analytics API docs, 2026). The API is the cleaner path; aggregation is daily with up to a one-hour delay, so it cannot be the only feed if you need minute-grain control. Anthropic Console workspace-level spend limits add a threshold-email backstop (Anthropic Console Workspaces, 2026); they are necessary but not sufficient, because they are workspace-level only with no per-feature view and no real-time refuse-next-call.
The Waxell framing is the load-bearing distinction: visibility is not control (Waxell, 2026). Only 51% of organizations strongly agree they can accurately track AI ROI despite 91% feeling confident overall; only 43% track AI cost by customer; under 22% by transaction (CloudZero, 2026). The confidence-competence gap is the failure mode the team-tier stack closes.
Three axes of attribution
A team-tier cost report needs three axes simultaneously: per-repo, per-team, per-feature. No single source ships all three. Finout’s four-step allocation framework covers per-developer, per-team, per-customer, and shared cost; it excludes per-repo OTel and per-feature PR labels by design (Finout, 2026). Datadog’s Claude Code Monitoring covers per-repo and per-user but not per-feature (Datadog, 2026). Anthropic’s docs cover per-user and a degenerate axis-3 (the single global claude-code-assisted PR label). The combination is unowned. Each axis has its own primitive.
Per-repo via OpenTelemetry resource attributes. Wrap the Claude Code launcher: OTEL_RESOURCE_ATTRIBUTES=service.namespace=$(git -C . remote get-url origin | xargs basename) CLAUDE_CODE_ENABLE_TELEMETRY=1 claude. Every span the session emits now carries service.namespace. The OpenTelemetry GenAI semantic conventions ship gen_ai.client.token.usage as a Histogram with required attributes gen_ai.operation.name, gen_ai.provider.name, and gen_ai.token.type; Datadog adopted them natively in v1.37 (OpenTelemetry GenAI metrics, 2026; Datadog, 2026). One caveat: gen_ai.usage.input_tokens includes cached tokens by spec, while cache_read.input_tokens and cache_creation.input_tokens are subsets; homegrown dashboards that double-count cached tokens are common. Read the spec before you sum.
Per-team via SSO and SCIM identity. Pull the daily Claude Code Analytics API rollup, JOIN against Okta or Azure AD groups by email. Result: per-team, per-day, per-model, per-token-type table. The same shape works for GitHub Copilot’s pooled AI Credits and for Cursor’s enterprise admin API, which exposes SAML SSO, SCIM provisioning, and team-level usage limits (Cursor Enterprise, 2026). No vendor publishes this JOIN; it is a 30-line script. The reason this axis matters even when per-user data exists: at 200 engineers, “user joe@example.com spent $1,200 last week” is not a useful escalation. “The payments team spent $14,000 last week, mostly on autocomplete-rollout, mostly on Opus” is.
Per-feature via PR labels. A lightweight GitHub Action that emits a feature/<slug> label on each PR (driven by branch name, JIRA ticket prefix, or a Feature: line in the PR body). A downstream cost-attribution job correlates session start and end timestamps to PR open and close timestamps. Imperfect, but the only attribution axis that survives a cross-team feature like a checkout rewrite that touches three repos and four squads. The axis falls cleanly out of the spawn-versus-stay subagent pattern: if subagents fan out to multiple repos for one feature, the per-repo and per-team views split the spend across siloes; the per-feature view is what reattaches it.
Two-tier alerting: soft cap, hard cap
Visibility-only dashboards do not stop runaway agents. The $47,000 LangChain loop fired Slack alerts at 50%, 80%, and 95% of its OpenAI account-level cap; none of them was enforcement. The two-tier pattern is borrowed from payments rate-limiting (Portkey documents the canonical shape at budget limits and alerts, 2026) and translated to Claude Code’s substrate: a soft Slack webhook fires at roughly 70% of the team’s daily AI Credit pool (notification only); a hard PreToolUse hook returns {"decision":"block","reason":"team budget exhausted"} at 100% (deterministic, runs every time).
The shape that motivates the pattern is concrete. Every transition crossed a soft cap that nobody acted on. The soft cap is the human-in-the-loop signal: a Slack message that includes the per-team and per-feature breakdown so on-call routes to the team that just crossed, not to a generic #alerts channel. The hard cap is the deterministic floor. Claude Code hooks are advisory-free; they run as code on every matched event regardless of what the model wanted to do, which is why hooks are the only deterministic substrate the agent stack offers. Returning block from a PreToolUse hook is the only place in the call chain where “stop spending” is a guarantee instead of a hope.
The integration shape with the Anthropic Console workspace spend limits adds a third tier as the quiet backstop: an org-level dollar cap that fires email at threshold and refuses requests beyond. Three tiers of defence (Slack soft, hook hard, console backstop) is the right number. Two of them are inside your control plane; one is inside the vendor’s. Plug the same shape into Copilot today by reading the preview-bill API, projecting AI Credit consumption at the team level, and tier-down rules at the repo level when projection crosses 80% of the monthly pool. The shape works for any vendor whose surface emits enough telemetry to budget against; vendor identity does not change the recipe.
The cost trajectory is also a tell on the demand-side optimization story: once the hard cap is in place, the question shifts from “how do we stop runaway burn?” to “how do we make our intentional burn cheaper?” That is a different post. The hard cap is the prerequisite. You cannot optimize what you cannot bound.
Is cost observability the same as compliance telemetry?
The same span fields needed to attribute a $4,200 runaway agent (model, prompt purpose, user, parent task, token count) are the fields the EU AI Act’s August 2 2026 agentic-AI provisions require: per-endpoint logs of data transmitted, purpose of transmission, sensitivity classification, continuous monitoring, full data lineage (EU AI Act portal, 2026; Raconteur audit guide, 2026). An organization that builds three-axis attribution to satisfy FinOps for AI today has, by accident, satisfied roughly 70% of its agentic-AI compliance posture for August. Most readers will frame this as a finance story. The real story is that cost telemetry and audit telemetry are the same telemetry.
The convergence is not coincidence. SOC 2 auditors began asking “where AI sits in your data environment and whether it touches customer data directly” through 2025 and 2026; per-request cost visibility, tagging, and alerts are now cited as enterprise expectations even though the standard has not yet added a dedicated AI criterion (Konfirmity SOC 2 changes, 2026). ISO/IEC 42001:2023 already requires lifecycle risk management, accountability, and audit trails for AI Management Systems (ISO 42001, 2023). The OpenTelemetry GenAI semantic conventions (still in Development status as of v1.36) are the wire format that makes the same telemetry portable across every stack (OpenTelemetry, 2026). Three forces converge on the same span fields. The substrate is shared.
The practical implication is sequencing. Start with cost attribution. The compliance fields (purpose, data class, lineage) fall out for free as your span shape stabilises. The site’s earlier piece on treating AI as a team member made the same argument from the other side: design the substrate first, the policies follow.
The cost of the alternative is concrete. A ccusage-for-team that has no concept of purpose, user identity, or data lineage cannot answer the auditor’s question, and the work has to be redone twice. Solo tools optimize for “did I exceed my plan.” The team-and-compliance problem requires “which user, which agent, which model, which task, which data class, for what business purpose, how much money.” There is no shortcut that gives you the second view if you only built the first.
What does a team-tier reference stack look like?
Putting the recipe together. A reference team-tier stack has six pieces, deploys in a week of focused work, and survives the multi-vendor reality.
- Per-repo emission. A Claude Code launcher wrapper that sets
OTEL_RESOURCE_ATTRIBUTES=service.namespace=<repo-from-git>andCLAUDE_CODE_ENABLE_TELEMETRY=1before exec’ingclaude. Spans now carryservice.namespacenatively. Five Claude Code OTel event types ship out of the box:api_request(model, cost, token counts, latency),tool_result(tool name, MCP server, success or failure, duration), and three off-by-default event types (OTEL_LOG_USER_PROMPTS=0,OTEL_LOG_TOOL_DETAILS=0); leave the off-by-default ones off (Claude Code Monitoring docs, 2026). - Per-team rollup. A daily cron pulling
/v1/organizations/usage_report/claude_codeand joining against Okta groups by email. Emits a per-team, per-day, per-model, per-token-type table. Repeat the JOIN shape for Cursor and for Copilot AI Credits. Three feeds; one schema. - Per-feature label and correlate. A
pr-feature-labelGitHub Action that derivesfeature/<slug>from branch name or PR body. A nightly correlation job that maps Claude Code session start and end timestamps to open and merge timestamps of PRs touching the same files. - Soft cap. A Slack webhook fired at 70% of the team’s daily AI Credit pool. Message body includes per-feature breakdown so on-call routes to the team. The webhook is signal, not enforcement.
- Hard cap. A
PreToolUseClaude Code hook that consults the rolling daily total and returns{"decision":"block","reason":"team budget exhausted"}at 100%. The hook is enforcement, not signal. - Console backstop. Anthropic Console workspace spend limits set at the org’s monthly ceiling, fires email at threshold, refuses on cap. The vendor-side floor.
Explicit non-features are as important as features. There is no per-keystroke logging. There is no prompt-content capture by default. There is no shadow attribution that engineers cannot see; the team-level dashboard is open to the engineers whose spend it tracks, because attribution-as-policing fails the cultural test the same way time-tracking does. v1 is read-only across the board; v2 adds compliance-extracted fields like gen_ai.prompt.purpose and data-class tags only when the auditor asks for them.
The org-chart implication is real. This stack lives in platform-eng or DevEx, not in finance. The 78% of FinOps teams now reporting into the CTO or CIO is the political tailwind: the finance partner is asking for these numbers, and the engineering leader’s incentive is to ship the recipe rather than answer monthly spreadsheets (State of FinOps 2026, 2026).
When you should not build this yet
Not every team needs the team-tier stack. Building three-axis attribution before you have the spend or the incident is engineering theatre, and theatre costs engineer-weeks. The bar to clear is one of two thresholds: monthly AI spend exceeds the cost of one engineer-week per month (roughly $5,000 to $8,000 loaded), or the team has had at least one runaway-agent near-miss in the prior quarter.
The four-question filter does the work.
- Do you have $5K-plus a month in AI spend across all assistants? Below the threshold, ccusage per developer plus a workspace-level spend limit is fine.
- Have you had a runaway-agent incident or near-miss this quarter? Even a single near-miss promotes the need; runaway events are the unmistakable signal that a hard cap is required.
- Do you have multiple AI tools or vendors? The unified attribution view earns its complexity exactly when one team uses Cursor, another Copilot, a third Claude Code. With one vendor, the vendor’s own analytics may be enough.
- Are EU AI Act-scope deliverables on your roadmap before August 2 2026? The compliance forcing function changes the math; once “audit trail of agentic AI use” is a deliverable, the cost-observability work is no longer optional.
The cost of premature build is concrete. Engineer-weeks; integration debt with vendor APIs that change quarterly; over-attribution surfacing political fights about whose team is “spending too much” before the spend justifies the conversation. The right minimal entry for many teams is ccusage per developer plus a workspace spend cap plus a manual cost review at month-end. The trigger to build the team-tier stack is the first runaway-agent near-miss. That signal is unmistakable. Until it arrives, optimize for being able to read the dashboard, not for owning every axis of attribution.
Frequently Asked Questions
GitHub Copilot moved to usage-based billing today. What do I instrument first?
Three things, deployed today. Wire the preview-bill API into a daily report so AI Credit consumption is visible per repo and per team. Deploy a pr-feature-label GitHub Action so cost-per-feature works from session one rather than retrofitted next quarter. Stand up a side-by-side claude-code-assisted and copilot-assisted PR labeler so the cost comparison from day one is honest. Pricing is Pro $10 a month with $10 AI Credits, Pro+ $39 with $39, Business $19 a seat with $19 ($30 promo June through August), Enterprise $39 a seat with $39 ($70 promo June through August) (GitHub Blog, 2026).
Can I just centrally collect ccusage’s JSONL files?
No. ccusage parses ~/.claude/projects/**/*.jsonl from one home directory; centralising the files solves the harvesting problem but not the identity problem. The parser knows nothing about teams, repos, features, or org policy. At team scale you need either a daily JSONL cron-to-S3 pipeline plus a JOIN against your identity provider, or the Claude Code Analytics API at /v1/organizations/usage_report/claude_code, which returns daily per-user rollups already keyed on email (Claude Code Analytics API docs, 2026).
Does OpenTelemetry GenAI semconv work for Claude Code?
Yes, opt-in. Set CLAUDE_CODE_ENABLE_TELEMETRY=1 and a standard OTLP exporter; Claude Code emits five OTel event types including api_request with model and cost and token counts and tool_result with tool name and duration, exported on a 60-second metrics interval and a 5-second logs interval (Claude Code Monitoring docs, 2026). Datadog supports the GenAI semconv natively as of v1.37 (Datadog, 2026). Caveat: the spec is still in Development status; expect schema churn before stability lands.
Where does cost observability stop and EU AI Act compliance start?
Mostly the same telemetry. The fields needed to attribute a runaway agent (model, user, parent task, token count, prompt purpose) overlap heavily with EU AI Act August 2 2026 agentic-AI logging requirements (per-endpoint data, purpose, sensitivity, continuous monitoring). Build cost first; the compliance fields fall out as your span shape stabilises (EU AI Act portal, 2026). The work is not redone; it is annotated.
What now?
Three takeaways for the build conversation.
- The discipline shifted under everyone’s feet. 31% of FinOps practices managed AI spend in 2024. 98% do today. The “I just track my own ccusage” stance is one inflection point behind the discipline.
- Three axes plus two tiers. Per-repo via OTel resource attributes, per-team via SSO and SCIM identity, per-feature via PR labels; soft Slack cap, hard
PreToolUsehook. The recipe survives multi-tool fleets and runaway-agent incidents in a way ccusage and a global cap alone cannot. - Cost telemetry is compliance telemetry. The same span fields satisfy FinOps for AI today and the EU AI Act in nine weeks. Build it once.
Today is June 1. Do the four-question pass on your team’s AI spend posture this week. What is your monthly AI spend across all assistants? What is your runaway-agent near-miss count this quarter? What is your hard-cap implementation today? What auditor question are you nine weeks away from? If the answers are “I am not sure,” “zero we know about,” “Slack alerts,” and “the EU AI Act August 2 deadline,” the Copilot flip already happened and the August deadline is closer than it looks. The schema arithmetic is patient. The bill is not.
If it was useful, pass it along.