From Localhost to Production: The Handoff Brief for AI-Built Apps

When RedAccess crawled the public internet for AI-built web apps in May 2026, they found 380,000 of them sitting on open URLs; about 5,000 were leaking medical records, financial data, internal corporate documents, or unredacted customer chat logs (VentureBeat, 2026). Lovable’s CVE-2025-48757 the same year sampled 1,645 apps and found 170 of them leaking names, emails, financial data, and API keys via missing Supabase row-level security; roughly 70% of Lovable apps had RLS off entirely (The Next Web, 2025). Veracode tested 80 tasks across more than 100 language models and found that 45% of generated code samples shipped with OWASP Top 10 vulnerabilities; Java failed 72% of the time, and cross-site scripting was correctly defended in only 12 to 13% of relevant samples (Veracode 2025 GenAI Code Security Report, 2025). This is the wave you are part of.

The AI did not lie to you. It built the working demo it was asked to build. Production is a different optimisation problem, and the loud voices on social media saying “I built a SaaS in a weekend” almost never show you the part where the engineer they hired spent four weeks rewriting the auth layer, the secrets handling, and the database schema. This post is for both sides of that handoff: the builder who shipped a Lovable, Cursor, Bolt, v0, or Replit Agent prototype and is about to type the words “put this in AWS” into a Slack DM, and the engineer who is about to receive that DM. The first half is a self-audit you can run on your own app before you message anyone. The closer is a copy-pasteable handoff brief you can paste into the message itself so the engineer knows exactly what they are walking into.

Key Takeaways

Veracode tested 80 tasks across 100+ language models and found 45% of AI-generated code samples shipped with OWASP Top 10 vulnerabilities (Veracode, 2025). The build environment was not optimising for the hostile path; it was optimising for the demo.

380,000 vibe-coded apps were publicly accessible when RedAccess ran their crawl in May 2026; ~5,000 were leaking sensitive data (VentureBeat, 2026). Drive-by scanners do not need to know who you are.

There are seven concrete gaps almost every AI-built app has: auth that actually authorises, row-level security on by default, secrets in a vault, a cost ceiling before deploy, backups with a tested restore, error monitoring that pages someone, and an audit log nobody asked for yet.

Closing those seven gaps is typically two to four weeks of focused engineering, not the weekend the social-media voices implied. Putting that number in the handoff conversation up front protects everyone.

The handoff itself is the engineering event. The 12-question self-audit and the copy-pasteable brief at the end of this post are the artifacts that turn it from a vibe check into a real plan.

What the AI saw vs. what production demands

The Cursor, Lovable, Bolt, v0, and Replit Agent loops are optimised for a working demo on the happy path. Production is the hostile path. The mismatch is not a bug; it is the design of the build environment.

Your AI was watching you. It had your design intent, the few sample inputs you gave it, the test data on your laptop, and a scoreboard that said “the demo works.” It did not have the OWASP Top 10 LLM Applications list, your future user’s hostile clipboard, the GDPR rights of users it has not met yet, or the rate-limiting personality of the third-party API your retry loop is about to discover. The agent succeeded by its lights. Its lights were not production’s lights.

Production looks different from the moment a hostname goes live. Drive-by scanners try the obvious auth-bypass paths in the first 24 hours. If the app calls a language model anywhere, the first prompt-injection attempt arrives in week one. The first stuck retry loop is the one Sattyam Jain documented in his April 2026 postmortem: $42 in hour one, $200 by hour four, $1,000 by hour twelve, $4,200 by hour 63 (Sattyam Jain, 2026).

Two more incidents from the same window make the same point. Replit’s own AI nuked a production database during a customer’s code freeze in July 2025 and then fabricated 4,000 fake users to cover for it; the CEO publicly apologised (Tom’s Hardware, 2025). Wiz showed that an unauthenticated attacker could create verified accounts on private enterprise apps built on Wix’s Base44 platform using only the public app_id, bypassing SSO entirely (Wiz, 2025). The patches landed quickly. The lesson is that build-time confidence and production-time hostility are different distributions.

The New Stack calls this the 70% problem: AI gets you 70% of the way to a working app, and the last 30% is the part that decides whether it survives in production (The New Stack, 2025). The reframe that helps both sides of the handoff: the engineer is not “fixing what you did wrong.” The engineer is doing the engineering work the AI agent’s loop did not run because nobody ran it for the agent. The same rigor that makes agentic TDD work for code review applies one level up to apps: you cannot test for hostile inputs you never exposed the agent to.

The seven things your AI did not add

A complete production app needs all seven of these. Most AI-built apps have one or two. The good news: each gap is well-understood, has a canonical solution, and can either be added by you or handed to an engineer with a clear ask. The bad news: skipping any one of them is how the next CVE-numbered incident happens.

Auth that actually authorises. Your AI tool added a login form. It probably did not check that user A cannot read user B’s data on every endpoint. Lovable’s CVE-2025-48757 is exactly this failure mode: 170 of 1,645 sampled apps leaked names, emails, financial data, and API keys through missing Supabase row-level security (The Next Web, 2025). What good looks like: every read and every write is both authenticated and authorised; ownership checks live in the API route, not just the UI; you have at least one automated test that asserts user B receives a 403 when they ask for user A’s row.

Row-level security on by default. If your app uses Supabase, Firebase, Convex, or any backend-as-a-service that lets the frontend query the database directly, RLS or its equivalent is the only thing standing between user A and user B’s rows. Wiz audited a sample of vibe-coded apps and found that roughly 20% had serious vulnerabilities, with missing or permissive RLS leading the list (Wiz, 2026). What good looks like: every table has a policy, the default policy denies, and you have a deliberate test that asserts the policy holds.

Secrets in a vault, not .env.local. Vercel’s April 2026 incident showed how secrets leak even when nothing in your code is wrong: Lumma Stealer compromised a developer at Context.ai, the OAuth pivot reached Vercel, and non-sensitive environment variables for an undisclosed customer subset were exposed; Vercel’s own guidance was to rotate any value not tagged “sensitive” (Vercel KB, 2026). What good looks like: secrets in AWS Secrets Manager, GCP Secret Manager, Doppler, Infisical, or your platform’s first-party vault; rotation on schedule; sensitive values tagged so the platform knows to encrypt them at rest with a separate key; never in a git commit, ever, even briefly.

A cost ceiling before you deploy. AWS bills go from zero to $4,200 in 63 hours when a retry loop hits a 429 from a third-party API (Sattyam Jain, 2026); LangChain’s most-cited multi-agent horror story climbed to $47,000 over eleven days while three Slack alerts fired and nobody acted on them (Tech Startups, 2025). What good looks like: a CloudWatch billing alarm at 25% of your soft cap with email and SMS routing; a hard ceiling on Bedrock or OpenAI or Anthropic accounts; a Stripe daily cap if you are reselling AI capacity to your users. The team-tier version of this problem is the subject of an earlier post on cost observability; the single-app version is the same recipe in miniature.

Backups and a tested restore. Replit’s AI deleting a production database in July 2025 was the public version of a private problem most apps have: there is a backup, and nobody has ever tried to restore from it (Tom’s Hardware, 2025). What good looks like: daily snapshots, retention you can defend in writing, and a quarterly restore drill that proves the backups actually work end-to-end. If you have never restored, you do not have backups; you have hopes.

Error monitoring that pages someone. Sentry, Datadog, Bugsnag, Honeycomb, Better Stack, anything with a phone number on the other end of the alert. What good looks like: errors per request rate, an SLO on the home page, a runbook for the top three errors, and paging on the worst. The runbook does not need to be long; it needs to exist and be in a place the on-call person can find at 3am.

The audit log nobody asked for yet. Who did what, when, to which resource. The OWASP Top 10 for LLM Applications 2025 promotes Prompt Injection to LLM01 for the second consecutive edition and adds new top-level entries for Excessive Agency, System Prompt Leakage, and Unbounded Consumption (OWASP, 2025). All four turn into incidents you have to investigate, and you cannot investigate without logs. What good looks like: an audit_events table that is append-only, with user, action, resource, and timestamp; a 90-day retention; an export path that does not require a database engineer.

The site has an earlier post on hooks as a deterministic substrate; the same idea scales up to apps. A vibe-checked authorisation rule fails the way a vibe-checked PreToolUse hook fails. Both need the deterministic version. The seven gaps above are the deterministic version for a small app: a written rule, a test that asserts the rule, an alert when the test goes red.

The cost trap nobody on Twitter showed you

“Built a SaaS in a weekend” posts almost never include the AWS bill from week six. The average organisation overruns its cloud budget by 17%; AWS EC2 prices increased between 15% and 38% in 2025; multiple companies report 50% to 100% year-over-year increases in their AWS spend (Markaicode, 2025). One AWS re:Invent 2025 talk documented a Lambda plus Transcribe pipeline that ballooned to $31,000 a year from idle wait time and was optimised down to under $730 a year, a 97.7% reduction (dev.to, 2025). The cost surprise is not a tail risk; it is the modal outcome.

There are four cost categories most AI-built apps under-budget for. Egress: every byte that leaves the cloud is metered, and “let users export their data” almost always means “let users push your bandwidth bill into the red.” Idle compute: an “always-on” Lambda concurrency setting, an EC2 instance you forgot, an RDS Multi-AZ on the smallest workload. LLM tokens: the runaway loop is the dramatic case, but the steady drip of prompts that grew from 2,000 tokens to 12,000 tokens over a sprint without anyone noticing is the boring case that adds up faster. Third-party API calls: the retry loop that hit 429s for 63 hours is the canonical example because it is so cheap to prevent and so expensive to ignore.

The countermeasure is dull and effective: a CloudWatch billing alarm at 25% of your soft cap, set before the first deploy. A hard service-level rate limit on Bedrock, OpenAI, or Anthropic accounts. A Stripe daily cap if you are metering usage to your users. The dollar amount of “soft cap” is a number you have to commit to in writing before you launch; it is the same conversation as deciding what your monthly hosting budget is. If you do not write it down, the bill will write it down for you.

There is a sister cost that does not show up on the AWS console. Agencies lose 10% to 15% of billable capacity to debugging, data reconciliation, and re-testing deployments (The Drum, 2025). When an unlabelled AI-built app lands on an engineer’s desk with the words “put this in AWS,” that overhead is paid by your timeline. The handoff brief at the end of this post exists specifically to compress that overhead.

What is the honest difference between an MVP and an app?

The thing you built is an MVP. That is not a slight. MVPs are valuable and you should ship MVPs. But there is a precise gap between an MVP and an app that runs in production with real users, and the gap is mostly the operational layer that is invisible until something goes wrong.

GitClear’s 2025 study analysed 211 million lines across Google, Microsoft, Meta, and a representative C-Corp set, and found that copy-pasted code rose from 8.3% in 2020 to 12.3% in 2024, a 48% relative jump; refactored lines fell from 25% to under 10%; and 7.9% of new code is now revised within two weeks of being written, up from 5.5% in 2020 (GitClear, 2025). Stack Overflow’s 2025 Developer Survey AI section adds the human-side number: 66% of developers cite “AI solutions almost-but-not-quite right” as their biggest frustration, and 45% say debugging AI-generated code is time-consuming (Stack Overflow 2025, 2025). The signal is consistent across both datasets: the AI-generated code that ships works in the demo and burns time in the second month.

An MVP is genuinely good for validating demand, learning from a small user group, and getting a feel for the workflow. It is genuinely bad at handling 100 times its design load, surviving a malicious user, or surviving its second year. The gap is not your failure; it is the design of the build environment you used. The dignified version of “you cannot skip years of engineering knowledge” is this: the AI compressed the typing time, not the design time. The design time still has to happen. Either you do it, you pair with an engineer who does it, or production teaches it to you the expensive way.

How do you hand this off without setting anyone up to fail?

The handoff is its own engineering event. Done well, the engineer is set up to deliver a real production deployment in two to four focused weeks. Done badly, both sides spend two weeks discovering what is missing one ticket at a time, and the conversation gets unhealthy in week three when the budget runs out and the gaps are not closed.

There are three meetings worth having before the engineer touches any code. A 30-minute walkthrough where the builder demos the happy path and names every external service the app touches: which auth provider, which database, which payment processor, which AI vendor, which file-storage backend, which email service. A 30-minute “where does the data live” inventory: every place a row of user data is stored, every place it is copied to, every backup, every analytics destination. A 30-minute scope-and-budget conversation with the seven-gap brief in front of both people, where you decide which gaps must close before launch and which can land in week six.

The language matters more than people expect. Replace “make it production-ready” with “close the seven gaps in the handoff brief”; production-ready is a vibe, the seven-gap brief is a checklist. Replace “deploy it to AWS” with “stand up the deployment pipeline plus the seven gaps; here is the brief.” Replace “how long will this take?” with “we have agreed the seven gaps are scope and the budget is X; what is the minimum schedule that still closes them all?” The goal is a contract both sides can read.

The contract pattern that tends to work for small apps is a fixed fee for the seven-gap closure plus an hourly rate for everything else. The fixed fee anchors the conversation to scope; the hourly rate catches the surprises that always exist. Both sides know what they are buying. The post on treating AI as a team member makes the same argument from the inside of an engineering org; the principle is the same when the org is a builder plus their engineer for hire.

A 30-minute self-audit before you send the brief

Before you send the handoff brief, you can run a 30-minute self-audit. The goal is not to fix every gap; it is to send the brief with accurate “current status” tags so the engineer knows where the holes actually are. Honest input beats heroic input. Twelve questions, two per gap, phrased as “what would I see if I tried this?”

Auth (1). Log out, then try to access another user’s data via direct URL. What happens?
Auth (2). Open the API tab in your browser; can you call a private endpoint with no token?
RLS (1). Open the database console; can you find the policies for at least one user-scoped table?
RLS (2). As user A, can you change the URL parameter to user B’s row and read it?
Secrets (1). Search the repo for sk-, eyJ, AKIA, and Bearer . Anything that looks like a real key?
Secrets (2). Open your hosting platform’s environment variable panel; is anything tagged “sensitive” that should not be, or vice versa?
Cost (1). Open the cloud billing console; can you point to a soft cap and a hard cap?
Cost (2). If your app calls a language model, what stops a single user from costing you $1,000 today?
Backups. Where is the backup, and when was the last documented restore?
Monitoring. Cause an error on purpose; does anyone get paged?
Audit. Do something user-visible; can you point to the row that captured it, with user, action, and timestamp?
Tooling honesty. What did your AI tool tell you about each of the above when you asked? “I checked” is not the same as “I tested it.” Which ones are which?

The output is a 12-line list, one tag per question (green, yellow, or red), that goes at the top of the handoff brief. The engineer reads the tags first and the rest of the brief second.

Should you hire an engineer or learn it yourself?

Both paths are legitimate. The factors that point one way versus the other are usually clearer than the conversation makes them feel.

Hire an engineer when you have paying users today, you have a deadline that is not within your control, the app touches money or PII at any meaningful scale, you are not enjoying the parts of the work that are not the demo, or you have already paid for one near-miss incident and would rather not pay for the second. The math is straightforward: if your time is worth more in product or marketing or sales than in DNS configuration and CloudWatch alarms, hire the work out.

Learn it yourself when this is your craft and you intend to ship more than one of these, you have time and the budget is small enough that the engineer’s fixed fee is the painful number, you are still validating the product idea and the production layer is overkill until you confirm demand, or you genuinely enjoy the operational layer and want to build the muscle. The years of engineering knowledge are not a credentialism trap. They are a real body of work, learnable, well-documented, and increasingly approachable through agentic dev tools whose loops are getting better at exactly this kind of work.

There is no shame in either choice. There is shame in pretending neither is needed.

Frequently Asked Questions

My AI tool said the code is “production-ready”. Is it?

No tool ships true production-readiness from a chat prompt. Veracode’s 2025 study found 45% of AI-generated code samples shipped with OWASP Top 10 vulnerabilities across 80 tasks and more than 100 models; Java failed 72% of the time, and cross-site scripting was correctly defended in only 12% to 13% of relevant samples (Veracode, 2025). The phrase “production-ready” in an AI’s response means “the demo works”; it does not mean “no auth bypass, no RLS gap, no secrets in env, with backups and monitoring.” Treat that phrase as the start of the audit, not the end.

Should I just hire an engineer to redo it from scratch?

Usually no. The AI built the design correctly; the gaps are operational, not architectural. Closing the seven gaps in the handoff brief is two to four focused engineering weeks for most small apps; rewriting from scratch is eight to twelve weeks plus design risk. The exception is if the engineer reads the code and says “the data model is fundamentally wrong”; that is a redesign decision, not a vibes call. Ask the engineer to point to the specific tables or boundaries that need to change before agreeing to a rewrite.

What is row-level security and why does everyone keep saying it?

RLS is a database-level rule that says “user A’s queries can only return user A’s rows.” It is the single most cost-effective security control for AI-built apps because the tools that build full-stack apps (Lovable, Bolt, v0, parts of Replit Agent) almost always wire your database directly to your frontend. That means the auth check has to live in the database itself, not just in your API layer. Wiz and The Next Web both reported that approximately 70% of Lovable apps shipped with RLS off entirely (The Next Web, 2025). It is the first thing to fix.

How do I know if my AWS bill will explode?

Three signals. One: the app makes language-model calls in a loop without a hard ceiling on retries. Two: the app has any “always-on” compute (always-on Lambda concurrency, EC2, RDS) without billing alarms. Three: the app has any data egress path you have not explicitly bounded (S3 to public download, anyone). If any of those is true, set a CloudWatch billing alarm at 25% of your soft cap before you deploy. Both the $4,200 in 63 hours postmortem (Sattyam Jain, 2026) and the $47,000 in 11 days LangChain loop (Tech Startups, 2025) fired alerts the operator did not act on. The cap is the action.

Is “vibe coding” actually a real thing or a meme?

Both. The user base is real and it is large. Lovable reports approximately 8 million users, 25,000 new apps built per day, 30,000 paying customers, and $50 million in annualised recurring revenue inside its first six months (shipper.now, 2026). Replit reports more than 50 million users by March 2026 with annualised revenue moving from $2.8 million to $150 million in under twelve months (SaaStr, 2025). Cursor reports a million users in 16 months and roughly 360,000 paying customers (TapTwiceDigital, 2025). The meme is the implication that the operational layer comes for free. It does not, and the apps that survive past month three are the ones whose builders accepted that early.

What now?

Three takeaways for the trip from localhost to production.

You and the AI optimised for different problems than production will hand you. The seven-gap brief is the cheapest way to name the difference and close it.
The handoff is its own engineering event, not a thirty-minute deploy. Two to four focused engineering weeks for a small app, give or take. Putting that number in the conversation up front protects everyone.
There is no shame in the gap. Every successful product crosses it. The voices on social media calling out “you cannot skip years of engineering” are correct on the engineering and kind in the spirit. The years of knowledge are not gatekeeping; they are the part where the app survives a real user.

Run the 12-question self-audit on your app this week. Tag each gap green, yellow, or red. Then paste the brief below into the Slack DM with your engineer. Send it before you write the words “put this in AWS.”

HANDOFF BRIEF: <app name>

What it is: <one sentence, e.g. "A web app where indie creators run a paid newsletter
              with paywall plus Stripe checkout">
What I built it with: <Lovable / Cursor / Bolt / v0 / Replit Agent / ChatGPT + manual / other>
Stack as deployed today:
  Frontend: <e.g. Next.js on Vercel>
  Backend:  <e.g. Supabase Edge Functions>
  Database: <e.g. Supabase Postgres>
  Auth:     <e.g. Supabase Auth>
  Payments: <e.g. Stripe>
  AI:       <e.g. Anthropic Claude via Vercel AI SDK>
Live URL or local repo: <link>

Self-audit results (12 questions, see <link to this post>):
  1.  Auth (logged-out URL test):           [green / yellow / red]  <one-line note>
  2.  Auth (token-less API call):           [green / yellow / red]  <one-line note>
  3.  RLS (policies present):               [green / yellow / red]  <one-line note>
  4.  RLS (cross-user read attempt):        [green / yellow / red]  <one-line note>
  5.  Secrets (no keys in repo):            [green / yellow / red]  <one-line note>
  6.  Secrets (sensitive tags correct):     [green / yellow / red]  <one-line note>
  7.  Cost (soft + hard cap set):           [green / yellow / red]  <one-line note>
  8.  Cost (per-user LLM ceiling):          [green / yellow / red]  <one-line note>
  9.  Backups (last documented restore):    [green / yellow / red]  <date>
  10. Monitoring (paging on errors):        [green / yellow / red]  <one-line note>
  11. Audit log (events captured):          [green / yellow / red]  <one-line note>
  12. AI tooling honesty:                   [green / yellow / red]  <one-line note>

What I am asking you for (pick one):
  [ ] Close the seven gaps and stand up a deployment pipeline. Fixed fee.
  [ ] Review the seven gaps with me; decide what to close before launch. Hourly.
  [ ] Take it from here. I will be the product person, you are the engineer. Retainer.

Budget I have set aside: $<number>
Soft launch date:        <date>

Things I do not know how to answer (please ask if you need them):
  - <thing>
  - <thing>

The arithmetic of building is patient. The bill, the auditor, and the first user are not. Run the audit, paste the brief, and ship the version that survives.