State of AI dev — June 2026: the month the demo stopped counting · Metaheuristic

There’s a sentence that showed up, almost verbatim, in three different client kickoffs this month: “It worked in the demo.” It’s always said as a defense and it always means the same thing - the thing was built, somebody recorded a Loom, and then it met real users, real permissions, and real data volume, and it fell over. June was the month the wider industry stopped pretending the demo was the hard part. A widely-shared piece on the forward-deployed engineer put it plainly: a RAG chatbot over 10 documents “works like magic” in the honeymoon phase, but move it to thousands and “efficiency, accuracy and security collapse under the weight of unmanaged complexity” (iNews24). That collapse is the entire job now. This is what we saw.

TL;DR — the month in one paragraph

Jun 16: The consensus quietly flipped from “can AI write this?” to “can this survive deployment?” Packt’s roundup led with the blunt version: why most AI systems fail after deployment.
Google I/O 26: Google bet the keynote on the “Agentic Enterprise” - Antigravity for agent orchestration, Gemini 3.5 Flash for coding - pushing agents from chat into autonomous workflows (Google Cloud).
MCP crossed 9,400+ public servers and became the default integration layer for Claude, Cursor, Windsurf, Copilot, and Replit - “USB-C for AI integrations,” now operational, mostly without auth (Provide.ai).
The supply chain got loud: AI agents are “installing packages no one owns,” Socket raised a $60M Series C at a $1B valuation, and Endor Labs shipped AURI to scan inside Cursor and Claude Code (The New Stack).
A directional stat made the rounds: one vendor study claimed 62% of AI-generated code is vulnerable - treat the number as marketing, the failure mode (missing row-level security, auth bypass that passes every functional test) as real.
The toolchain consolidated: survey data showed early-stage teams standardizing on Claude Code over Cursor and Copilot, while GitHub started charging separately for Copilot code review.

Agents crossed the line from demo to production - badly

Every major platform spent June pushing agents from “answer a question” to “execute a workflow.” Google’s I/O framing - autonomous agents acting “across apps, data systems, and developer environments” - is now the default ambition, not the moonshot. The problem is that the gap between a scripted demo and an autonomous workflow is exactly where the failures live. Packt’s lead story this month wasn’t “how to build an agent,” it was why most AI systems fail after deployment - because that’s the part nobody rehearses.

What we saw in client code: agents with no idempotency on side-effecting tools (the same “send invoice” call fires twice on a retry), no budget ceiling on token spend (one looping agent burned a four-figure bill overnight), and no human checkpoint on irreversible actions. The demo never exercised any of those paths because the demo ran once, happily, with a friendly input.

What to do: before an agent ships, write down its irreversible actions and put a human-in-the-loop gate on each one. Add a hard step-count and token-budget ceiling that fails closed. Treat every tool call as if it will be retried, because under load it will be. This is the bulk of what an AgentOps retainer actually does in month one.

Retrieval is a permissions problem wearing a search costume

The single most common defect we found this month wasn’t bad embeddings or weak reranking - it was a RAG system happily returning documents the asking user was never allowed to see. The OX study that got passed around (62% of AI-generated code is vulnerable) is marketing-shaped and we’d treat the headline number as directional only, but the mechanism it describes is one we keep fixing in person: row-level security “was never configured on the database,” and the app “can pass every functional test, deploy successfully, and serve thousands of users before anyone notices the authentication bypass sitting in the login flow.”

That’s retrieval’s version of the same bug. The vector search works. The latency is great. And it leaks, because permission was treated as a UI concern instead of a retrieval-time filter. The forward-deployed-engineer piece names the same wall: the prototype over 10 docs is trivial; the production asset over thousands stands on “scalability, rigorous security and sustained performance” or it doesn’t stand at all.

What to do: filter by access control before the nearest-neighbor search, not after - the wrong document should never enter the candidate set. Test it adversarially: log in as a low-privilege user and try to retrieve a high-privilege document by paraphrasing it. If you can, your RAG is an exfiltration tool. Permission-aware RAG is most of our build work right now for exactly this reason.

Guardrails became the perimeter, and the supply chain is the new soft underbelly

June’s sharpest security story had nothing to do with prompt injection and everything to do with provenance. As The New Stack put it, AI coding agents “increasingly pull packages, add dependencies, and install tools autonomously” and “there is no accountability” - security teams are flying blind on code their own agents introduced (The New Stack). The money agreed with the thesis: Socket closed a $60M Series C at a $1B valuation on real-time malicious-package blocking (it flagged a malicious dependency in Axios within six minutes), and Endor Labs shipped AURI as an MCP server and CLI to detect vulnerabilities inside Cursor and Claude Code.

Meanwhile MCP itself quietly became load-bearing infrastructure - 9,400+ servers tracked across registries, native support in every major tool, crossing “from interesting spec to operational standard in roughly 18 months” (Provide.ai). Most of those servers ship with no auth story. An agent with tool access and a connected MCP server is a privileged binary that writes its own dependencies, and almost nobody is reviewing what it pulls.

What to do: pin and review every dependency an agent is allowed to add, and put a real allowlist on which MCP servers it can reach. Token-validate MCP connections - OAuth-style, not “it’s localhost so it’s fine.” Run an LLM security review against the OWASP LLM Top 10 before you give an agent write access to anything that matters. This is the work the new round of funding is telling you to take seriously.

Evals quietly became the unit of progress

The toolchain consolidated this month - survey data showed early-stage teams migrating to Claude Code for autonomous multi-file execution and lower hallucination rates, while GitHub started billing Copilot code review as a separate line item. But the more important shift is cultural: the teams shipping fastest stopped arguing about which model is smartest and started measuring whether their system got better between Tuesday and Thursday. With a new frontier model landing roughly every few weeks, “it feels better” is not a migration strategy. An eval suite is.

We wrote about this dynamic in momentum-driven code review - when the model writes the diff in seconds, the bottleneck moves to verification. The same is true one layer up: when you can swap GPT-5.4 for Gemini 3.5 Flash in an afternoon, the only thing standing between you and a silent quality regression is a graded eval set that runs in CI and blocks the merge.

What to do: build a golden set of 50-200 real inputs with known-good outputs, grade every prompt and model change against it, and gate deploys on it. Without it, every model upgrade is a coin flip you can’t see the result of. Evals and quality gates are the cheapest insurance in this entire stack.

Smaller signals

The “vibe coding” label is fracturing. Even Anthropic’s Boris Cherny is reportedly tired of the term - a sign the serious end of the market wants distance from “ship it and pray.”
Cursor pushed into agent orchestration with 2.4 subagents and a public-beta Cursor SDK - the IDE is becoming an agent runtime.
A genre of “is your vibe-coded app production-ready?” content exploded (Logic Square) - the market is internalizing that prototype and product are different artifacts.
The forward-deployed engineer is the role of the moment - not because AI can’t code, but because someone has to own the path from magic to scale.

Why these rhyme

Pull the month together and there’s one spine: the cost of building dropped to near zero, so all the value moved into everything that happens after “it works.” Permissions, idempotency, dependency provenance, eval-gated upgrades, the human checkpoint on the irreversible action - none of it shows up in a demo, and all of it shows up in an incident. That’s not a coincidence; it’s the structural consequence of generation getting cheap while consequences stayed expensive. The agencies, the funding, and the tooling all moved the same direction in June: toward the unglamorous layer between a prototype and a system you can put your name on. That layer - agentic workflows that don’t double-fire, retrieval that respects who’s asking, evals that catch the regression, guardrails that hold under a real adversary - is the whole of what we build. If your AI feature works in the demo and you’re not sure what happens next, that’s the conversation to have.

Manual checklist — 10 things to verify yourself

Irreversible actions are gated. List every side-effecting tool your agent can call; confirm each irreversible one has a human-in-the-loop checkpoint.
Agents fail closed. Verify a hard step-count and token-budget ceiling exists and stops a looping agent instead of billing you overnight.
Tool calls are idempotent. Trigger the same action twice (simulate a retry) and confirm it doesn’t double-charge, double-send, or double-write.
RAG filters by permission before search. Log in as a low-privilege user and try to retrieve a restricted document by paraphrasing it - you should get nothing.
Row-level security is actually on. Don’t trust the UI; query the database directly as an unauthorized user and confirm the policy blocks you.
No secrets in the client. Grep your shipped frontend bundle for API keys and tokens - the auth-bypass-that-passes-all-tests usually starts here.
Agent dependencies are reviewed. Check what packages your coding agent has added this month; confirm each is pinned and intentionally approved.
MCP servers are allowlisted and token-validated. Confirm your agent can only reach approved MCP servers and that those connections authenticate.
An eval set gates deploys. Confirm a graded golden set runs in CI and can block a merge - then change one prompt and watch it actually catch a regression.
You can answer “what happens after the demo?” For your top AI feature, write down the failure mode at 100x current volume. If you can’t, that’s the audit to book.

State of AI dev — June 2026: the month the demo stopped counting.

TL;DR — the month in one paragraph

Agents crossed the line from demo to production - badly

Retrieval is a permissions problem wearing a search costume

Guardrails became the perimeter, and the supply chain is the new soft underbelly

Evals quietly became the unit of progress

Smaller signals

Why these rhyme

Manual checklist — 10 things to verify yourself

Production AI, with guardrails.

TL;DR — the month in one paragraph

Agents crossed the line from demo to production - badly

Retrieval is a permissions problem wearing a search costume

Guardrails became the perimeter, and the supply chain is the new soft underbelly

Evals quietly became the unit of progress

Smaller signals

Why these rhyme

Manual checklist — 10 things to verify yourself

Production AI, with guardrails.

More on Metaheuristic.

The rise of skill-matched, momentum-driven code reviews (and the fall of the static review queue)

One short note every Friday.