Skip to content

MCP Server for Your Codebase: Tool-Shape, Not API-Mirror

24 min read

MCP Server for Your Codebase: Tool-Shape, Not API-Mirror

· 24 min read
An editorial illustration contrasting a wall of identical generic boxes labelled API mirror with a smaller set of differently-shaped purpose-built tools labelled tool shape, used as a metaphor for two opposing approaches to MCP server design

Cloudflare’s first cut at an MCP server for its public API would have eaten 1.17 million input tokens before the agent processed a single user message. Their redesign got it to roughly 1,000 (Cloudflare, 2026). The lesson is not about Cloudflare. It is about the default shape of every MCP server tutorial you have ever read.

Every tutorial teaches you to wrap an API endpoint per tool. That pattern works for five tools and breaks at fifty. For a codebase, where the right unit of retrieval is a semantic symbol rather than a file, it breaks earlier and worse. This post walks the design framework that holds: tool shape, freshness, OAuth 2.1, the failure modes you have to instrument, and the four-question filter that decides whether the server should exist at all.

Key Takeaways

  • The token cliff is mechanical. Cloudflare’s 2,500-endpoint API would have consumed 1.17M tokens as a one-tool-per-endpoint MCP; two generalised tools collapsed it to roughly 1,000 (Cloudflare, 2026). Maxim AI corroborated independently at 508 tools across 16 servers: 75.1M flat-list tokens vs 5.4M code-mode, with pass rate held at 100% (Maxim AI, 2026).
  • Tool-shape, not API-mirror. Microsoft’s Learn MCP exposes three tools and states the principle plainly: “design tools for the agent workflow, not to mirror internal APIs” (Microsoft Engineering, 2026). GitHub cut its Copilot toolset from 40 to 13, gaining 2 to 5 pp on SWE-Lancer plus 400 ms of latency (GitHub, 2026).
  • A codebase server is a search engine, not an API gateway. Symbol indexing, hybrid retrieval, freshness invalidation. Symbol-level retrieval beats file-level at 9.93 of 10 relevance with 10x fewer tokens on a 1,000-round in-house benchmark.
  • Auth and transport are not optional. The MCP spec 2025-11-25 made OAuth 2.1 plus PKCE mandatory for HTTP transport, and a February 2026 scan of 1,400 remote servers found 93% on Streamable HTTP (bloomberry, 2026). The Zuplo State of MCP Report still found 25% of servers ship with no auth at all (Zuplo, 2025).
  • Three failure modes you must instrument by design. Token blowup (Apideck measured 28% silent call failure on the GitHub MCP under realistic load), retrieval poisoning (Invariant Labs hit 72.8% attack success on o1-mini), and the April 2026 STDIO RCE class (up to 200,000 servers at risk; Anthropic declined to patch).
  • The hardest design question is whether to ship at all. If the server is a slightly better grep, it should not exist. The four-question filter (does Claude Code’s filesystem MCP cover it; is a CLI cheaper; do you need symbol-level reasoning; will enough downstream agents amortise the build) decides.

Where 1.17 million tokens come from

When you mirror an API one endpoint per tool, schema overhead grows linearly with surface area. Cloudflare’s 2,500-endpoint API would have consumed roughly 1.17 million input tokens before the model saw a user message (Cloudflare, 2026). Two generalised tools (search() and execute() against a sandboxed V8 isolate) collapsed it to about 1,000 tokens. The reduction is 99.9%. The technique is not magic. It is arithmetic.

The cliff is mechanical, not a model-quality issue. You do not get to argue with the schema arithmetic. Maxim AI ran the same query set against flat-list and code-mode configurations at 508 tools across 16 servers and found 75.1 million input tokens vs 5.4 million (Maxim AI, 2026). That is a 92.8% reduction with pass rate held at 100%. Two independent benchmarks; the same direction; the same magnitude.

StackOne measured the GitHub MCP server at 94 tools and 17,600 tokens of schema overhead per call, max-compressible to roughly 500 tokens at the cost of some descriptive fidelity (StackOne, 2026). At the academic upper bound, the MCPVerse benchmark assembles 552 real-world tools whose combined schemas reach roughly 147,000 tokens, exceeding the input window of DeepSeek-V3 (64K), GPT-4o (128K), and Qwen3-235B (128K) (arXiv 2508.16260, 2025). At that surface, the failure mode is not “the agent picks the wrong tool”; it is “the tools cannot fit in context at all.”

GitHub published the cliff effect inside its own product. When the Copilot team cut the default toolset from 40 to 13, they measured 2 to 5 percentage points of gain on SWE-Lancer, 400 ms shaved off average latency, and 190 ms off time-to-first-token (GitHub, 2026). Performance did not degrade gradually past about 20 tools; it fell off a ledge. The same overhead discipline that drives skills progressive disclosure at the model layer is what’s at stake here, one layer down. The schema is the prompt, and the prompt has a tax.

Tool-shape: design by the question, not the endpoint

Microsoft’s Learn MCP server exposes three tools total: microsoft_docs_search, microsoft_docs_fetch, microsoft_code_sample_search. The team’s stated principle is the cleanest one-liner in the corpus: “design tools for the agent workflow, not to mirror internal APIs” (Microsoft Engineering, 2026). GitHub reduced its default Copilot toolset from 40 to 13. Block rebuilt its Linear MCP server three times: more than 30 tools, fewer, finally two.

Tool-shape is design vocabulary. Each tool is a complete answer to a question the agent asks. It is not a method on a class, and it is not a route on a router. The four-question filter is what does the work:

  1. What does the agent actually need to know?
  2. What is the smallest surface that returns it?
  3. What budget does the call consume in tokens?
  4. What does failure look like, and is the failure observable?

For a codebase, the answers translate into named operations: find_symbol(name, kind, scope) answers “where is this defined and what version of it”; trace_callers(symbol, depth, budget) answers “who depends on this”; find_references(symbol, kind=read|write) answers “where is it used, separately for reads and writes”; impact(change_set) answers “what tests should I run if I change these files.” None of those is a read_file mirror. Each is a question the agent would have to assemble from text search, multiple times, in a loop. The tool collapses the loop.

The obvious tension is worth naming up front. The site’s existing post on what a symbol-aware codebase server actually looks like documents a 32-tool MCP server. Thirty-two is more than thirteen. The relevant comparison is not 32 vs 13 in the abstract; it is 32 purposeful tools with precise descriptions vs 30 endpoint-mirroring tools with generic descriptions. The first is a tool-shape design that earned its surface area; the second is the API-mirror anti-pattern in disguise. Microsoft’s three tools and a 32-tool symbol graph are both tool-shape designs. The number is downstream of the discipline.

A codebase server is a search engine, not an API gateway

API gateway vocabulary (endpoints, routes, REST, methods) is the wrong mental model for a codebase. Search engine vocabulary (index, query, ranking, freshness) is the right one. A codebase server’s first design decision is the index, not the endpoint list. Symbol-level retrieval beats file-level retrieval at 9.93 of 10 relevance with 10x fewer tokens and 2.1x fewer tool calls on a 1,000-round in-house benchmark. No SERP competitor has run this number for a codebase MCP server.

Library card catalogue close-up showing rows of indexed wooden drawers, used as a metaphor for the symbol index that sits at the heart of a well-designed codebase MCP server

Index as architecture. The right primitives are a symbol graph, a type graph, a call graph, and a dependency graph. Not a flat file list. Tree-sitter is the practical AST parser; a persistent knowledge graph keyed on symbol identity rather than file path is the practical store. The arXiv MCPVerse benchmark and the open-source codebase-memory-mcp implementation both converge on this design from independent directions (arXiv 2508.16260, 2025). When two unrelated efforts pick the same shape, that shape is not opinion.

Query execution is hybrid retrieval. BM25 for lexical matches plus a vector index for semantic ones, then a cross-encoder rerank when the call budget allows. None of this is novel ML; it is standard search infrastructure that senior infra engineers already know how to operate. The move is to recognise that a codebase MCP server is search infra wearing a tool-call API, and to bring the search-infra discipline (sharding, recall-vs-precision tradeoffs, query plans) into the design conversation.

Ranking signals are where codebase servers earn their keep. Recency from commit timestamp; centrality from caller count; blame heat from recent change frequency; testedness from coverage data. Each signal answers a question the agent would otherwise have to ask in series: which version of this is current, which paths matter most, which paths are stable, which paths are risky. None of those signals appears in any current MCP tutorial; all four are well-understood in classical search ranking. Bringing them across is the move.

The practical opening question is therefore not “what endpoints should this server expose?” It is “what questions should this server answer that text search cannot?” If the answer is “none”, do not build the server. That cut is the bridge to the next two sections: freshness, which is what makes the index trustworthy, and the decision tree for whether the server should exist at all.

Freshness: the production layer most servers skip

A static index is the silent failure mode of every codebase MCP server. Internal codebases change hundreds of times a day. An agent that queries a symbol renamed three commits ago receives a confidently wrong answer; worse than a missing answer because the agent commits to it. The MCP spec 2025-11-25 defines no protocol-level caching primitives (modelcontextprotocol.io spec). Servers implement freshness independently, and most do not implement it at all.

Four practical patterns cover the design space. TTL expiration is the cheapest: pick a window, expire after, accept staleness inside the window. It is the right default for read-mostly libraries and the wrong default for active codebases. Event-based invalidation hooks a commit hook or file watcher and clears the cache when the underlying file changes. Version-based invalidation stores a content hash with each cached entry and invalidates when the hash drifts; this is cheap, correct, and the default I recommend. Notification-based invalidation uses the spec’s notifications/tools/list_changed to tell the client when tool capabilities change, but this is a tool-list signal, not an index-freshness signal. Do not confuse them.

The cost of staleness is concrete. Hallucinated function signatures lead to refactors that compile and break at runtime. False negatives on impact analysis lead to test runs that miss the regression. Calls to a deprecated symbol succeed in the editor and fail in CI a day later. Each failure looks like a model error and is actually an index error.

The practical default is file watcher plus content hash. Surface staleness explicitly to the agent in every tool response: include an as_of field with the commit SHA and timestamp the index was last refreshed against. The agent then decides whether the result is acceptable for its task; for read-only navigation, slightly stale is fine; for codemods or refactors, the agent can request a re-index. Most queries should not block on a re-index; serve a stale-flagged result and re-index in the background. Block the call only when the staleness is recent and the tool is high-stakes.

Auth and transport: OAuth 2.1, Streamable HTTP, per-tool scopes

The MCP spec 2025-11-25 made OAuth 2.1 with PKCE mandatory for HTTP transport (modelcontextprotocol.io authorization, 2025). Streamable HTTP replaced the deprecated HTTP+SSE in spec 2025-03-26. A February 2026 scan of 1,400 publicly addressable remote MCP servers found 93% on Streamable HTTP, 7% still on SSE (bloomberry, 2026). The Zuplo State of MCP Report still found 25% of servers shipping with no authentication at all and 38% of builders citing security as their top adoption blocker (Zuplo, 2025).

Pick the transport that matches the deployment shape. Stdio is the right default for local single-session servers running inside a single user’s process; it is cheap, fast, and the security boundary is the operating system. Streamable HTTP is required for shared multi-session servers (the standalone mode that any client can hit). The 2026 MCP roadmap targets stateless Streamable HTTP for horizontal scaling behind load balancers; today, most servers maintain per-session state, which prevents scale-out across replicas. Stateless design is non-trivial work; plan for it now if your server will outgrow a single host.

OAuth 2.1 is the spec mandate, not a suggestion. PKCE with S256 is required; implicit flow is prohibited; exact redirect URI matching is required; Client ID Metadata Documents are now preferred over Dynamic Client Registration. The SDK gap is real: as of May 2026, not every MCP SDK fully ships OAuth 2.1 plus PKCE, and client-side support varies by IDE. Calibrate the rollout to the part of the ecosystem you depend on; do not block on a piece of the spec your SDK has not landed yet.

The under-discussed pattern is per-tool-category scopes. Read tools require code:read; write tools require code:write; CI-trigger or codemod tools require code:execute. Mint per-call ephemeral tokens via RFC 8707 Resource Indicators so each token’s audience is bound to a specific tool category. The spec text supports this; the tutorial corpus does not. For a codebase server with a mixed read and write surface, per-tool scopes are not optional. They are the difference between “an attacker who phishes a read token” and “an attacker who phishes a write token.” This is the deterministic enforcement at the client side that MCP advisory access cannot give you.

mTLS deserves a mention. For internal-only servers with a fully controlled client population, mutual TLS plus a static service principal is simpler, cheaper, and equally secure. OAuth is the right answer for any externally reachable server; it is not always the right answer for a server that only ten engineers, on managed devices, ever touch.

Three production failure modes, and the design decisions behind them

OX Security disclosed an MCP STDIO architectural flaw on 15 April 2026; The Register reported up to 200,000 servers at risk of complete takeover with 150 million cumulative SDK downloads affected, and Anthropic declined protocol-level changes, calling the behaviour expected (The Register, 2026). Independently, an Apideck benchmark measured a 28% silent call failure rate on the GitHub MCP server under realistic agent workloads (Apideck, 2026). Invariant Labs demonstrated tool-description injection in Cursor with 72.8% attack success against o1-mini (Invariant Labs, 2025). Three failure classes; three design decisions behind them.

Token blowup is the most common production failure and the easiest to design out. The cause is tool count past the cliff plus missing pagination on list-style tools (list_all_symbols with no budget; find_references with no maximum). The design fix is a budget parameter on every retrieval tool, measured in output tokens, plus tool-count discipline upstream of the budget; the four-question filter from the previous section is what enforces it. Lazy tool loading via a search-tools tool (“which tools are relevant to this query?”) is the more aggressive variant; it keeps only the tools the agent will actually call in active context.

Retrieval poisoning, the ContextCrush class, is sneakier. An MCP server that reads source files containing untrusted dependency docstrings can hand the model an injected instruction without anyone noticing. The fix is two-part: sanitise text fields in tool responses, flagging external-origin content (third-party packages, network-fetched README fragments) as untrusted in a way the model can see; and never fold third-party docstrings into tool descriptions. Tool metadata and tool output belong in different trust zones. The protocol does not enforce that today; the server does.

The STDIO RCE class is architectural. The transport executes whatever command its configuration says, with whatever arguments, in the host shell, and the configuration is often user-controlled. OX Security’s April 2026 disclosure made this concrete; Anthropic’s response made it permanent. The design fix is to prefer Streamable HTTP for any non-local server; if you must run stdio, sandbox it (containerise; drop capabilities; restrict the file system); and look forward to signed tool manifests on the 2026 roadmap. A fourth, silent failure mode is staleness, covered structurally in the previous section. The unifying observation across all four is that MCP has no formal trust boundary between “metadata the model reads” and “instructions the model follows.” Until server-signed tool manifests exist, OAuth is hygiene at best. The category is not “vulnerabilities to patch”; it is design decisions, not compliance items.

When your codebase MCP server should not exist

Most teams build a codebase MCP server before answering “what question should this server answer?” The result is an API-mirror server that exposes file-read, file-list, and grep, which is strictly worse than the filesystem MCP that ships with Claude Code by default. The bar to clear is structural understanding: does the server deliver what text search cannot?

The four-question decision tree decides whether to ship.

  1. Does Claude Code’s filesystem MCP solve it? If the use case is “the agent can read files,” the answer is yes; do not build a server.
  2. Does a CLI tool solve it cheaper? Many code intelligence tasks (ripgrep, tree, git grep) are cheaper as CLIs invoked through Bash. Apideck’s benchmark found CLI invocations cost 4 to 32 times less in tokens than equivalent MCP calls for the same task (Apideck, 2026).
  3. Does the codebase need symbol-level reasoning? Call graph traversal, impact analysis, type resolution. If the answer is no, a search-engine over text plus the filesystem MCP is enough.
  4. Are there enough downstream agents (multiple use cases, multiple humans, multiple sessions) to amortise build cost? A bespoke MCP server costs maintenance, freshness work, RCE surface area, and token tax in perpetuity. If two engineers will use it twice, the math does not work.

The cost of a low-value MCP server is four taxes for the value text search would have delivered for free: token tax, maintenance tax, RCE surface tax, freshness debt. The right minimal server is often four to eight tools, not thirty or eighty. The minimal codebase MCP is a search engine over the symbol graph plus a callers-tracer; v1 read-only; write tools earn their place through demonstrated value, not through being requested. This is the allocation across the four context surfaces discipline pulled from the agent layer down to the tool layer.

A reference design

Putting the framework together. A reference codebase MCP server has six tools, runs Streamable HTTP behind OAuth 2.1, indexes symbols with tree-sitter and content-hash freshness, and instruments four failure modes by class. v1 is read-only; write tools earn their place by demonstrating value.

The architecture in inventory form:

  • Six tools. find_symbol(name, kind, scope), trace_callers(symbol, depth, budget), find_references(symbol, kind=read|write), impact(change_set), read_symbol(id), and search_code(query, budget) as the text fallback.
  • A budget on every tool. Each tool accepts a budget parameter, measured in output tokens. The server enforces it by truncation, pagination, or ranking; the agent never has to discover the call cost the hard way.
  • An as_of on every response. Commit SHA plus timestamp the index was last refreshed against. The agent decides whether stale-flagged data is acceptable for its task; the server does not assume.
  • Streamable HTTP behind OAuth 2.1. Read tools require code:read. No write tools in v1. Tokens are scoped to the tool category via Resource Indicators.
  • Tree-sitter index plus content-hash freshness. A persistent knowledge graph keyed on symbol identity. File-watcher invalidation; content-hash version check; stale-flagged results when re-index is in flight; background re-index for non-blocking calls.
  • Observability from the first commit. Structured logs per tool call (tool, latency, tokens, result count, hit-or-miss). OpenTelemetry metrics. Failure-mode classification: timeout, blowup, poison, stale.

Explicit non-features matter as much as the features. There is no “do anything” tool. There is no shell-exec surface. Untrusted content (third-party docstrings, network-fetched fragments) is never folded into tool descriptions. No tool ships without a declared budget. The v1 to v2 split is read tools first; write tools (refactor application, file generation, CI trigger) earn their place by clearing the four-question filter, not by being asked for.

The reference here is not a recipe to copy. It is the shape a working codebase server takes when each design decision is made deliberately rather than defaulted into. The site’s existing post on the matured 32-tool reference documents one such server in production, with a 1,000-round benchmark behind its design. Both are tool-shape designs; one is minimal, one is matured. The number of tools is downstream of the discipline, not upstream.

FAQ

Should an MCP server mirror my API? No. Mirroring an API one endpoint per tool produces token blowup that scales linearly with surface area. Cloudflare’s 2,500-endpoint API would have consumed 1.17 million tokens; their two-tool Code Mode redesign collapsed it to roughly 1,000 (99.9% reduction). Microsoft’s Learn MCP uses three tools. GitHub cut its Copilot toolset from 40 to 13 with measurable benchmark gains. Design tools at the shape the agent needs (Cloudflare, 2026; Microsoft Engineering, 2026).

What transport should a codebase MCP server use? Stdio for local-only single-session use; Streamable HTTP for shared multi-session servers. HTTP+SSE was deprecated in spec 2025-03-26, and a February 2026 scan of 1,400 remote servers found 93% on Streamable HTTP, 7% still on SSE (bloomberry, 2026). The 2026 MCP roadmap targets stateless Streamable HTTP for horizontal scaling.

Do I really need OAuth 2.1? For any HTTP-transport server, yes. Spec 2025-11-25 made OAuth 2.1 plus PKCE mandatory (modelcontextprotocol.io authorization, 2025). The Zuplo State of MCP Report found 25% of servers shipping with no auth and 38% of builders citing security as their top adoption blocker (Zuplo, 2025). Internal-only servers with a fully controlled client population can defensibly use mTLS instead.

How do I keep my codebase index fresh? File watcher plus content-hash invalidation is the practical default. Tools should return an as_of (commit SHA plus timestamp) on every response so the agent can detect stale results. The spec’s notifications/tools/list_changed covers tool capability changes, not index freshness; that responsibility lives in the server, not the protocol (modelcontextprotocol.io spec, 2025).

What now?

Three takeaways for the design conversation.

  • Tool-shape, not API-mirror. The Cloudflare 1.17M to ~1K reduction is mechanical, not aspirational. Each tool answers a question text search cannot. Each response declares its budget and freshness.
  • A codebase server is a search engine. Symbol indexing, hybrid retrieval, freshness invalidation. API-gateway vocabulary is the wrong starting point. The right ranking signals are recency, centrality, blame heat, and testedness.
  • Three failure modes you must instrument by design. Token blowup. Retrieval poisoning. STDIO RCE. Plus staleness, which is silent until it isn’t.

Before the next sprint, do the four-question pass on the codebase MCP server you have, or the one you are about to start. Does Claude Code’s filesystem MCP cover it? Is a CLI cheaper? Do you need symbol-level reasoning? Will enough downstream agents amortise the build? If you cannot answer all four cleanly, your design is not done. The schema arithmetic is not patient. Neither is the security model.

Share this post

If it was useful, pass it along.

What the link looks like when shared.
X LinkedIn Bluesky

Search posts, projects, resume, and site pages.

Jump to

  1. Home Engineering notes from the agent era
  2. Resume Work history, skills, and contact
  3. Projects Selected work and experiments
  4. About Who I am and how I work
  5. Contact Email, LinkedIn, and GitHub