Gartner® named Zenity the COMPANY TO BEAT in AI Agent Governance 🏁

System Prompts Are Not Security Controls: A Deleted Production Database Proves It

Portrait of Chris Hughes
Chris Hughes
Cover Image

On April 25th, a Cursor AI coding agent running Anthropic's Claude Opus 4.6, one of the most capable models in the industry, deleted the production database for PocketOS, a software platform used by car rental businesses across the country to manage their entire operations. The deletion took 9 seconds. A single GraphQL mutation against Railway's API wiped the production volume and every volume-level backup stored within it, because Railway stores backups in the same volume as the data they are supposed to protect. The most recent recoverable backup was three months old.

content image

PocketOS founder, Jer Crane, posted a detailed account of the incident that should be required reading for every engineering leader, every security team, and every vendor currently marketing AI agent integrations into production infrastructure. The agent was working on a routine task in a staging environment. It encountered a credential mismatch and decided, entirely on its own initiative, to "fix" the problem by deleting a Railway volume. To execute the deletion, it went looking for an API token, found one in an unrelated file that had been created solely for managing custom domains through Railway's CLI, and discovered that token had blanket authority across Railway's entire GraphQL API, including destructive operations like volumeDelete. No confirmation step, no environment scoping, and no humans in the loop. Jer even pointed out that there was no confirmation step, no warning, nothing.

The agent then, when asked to explain itself, produced a written confession enumerating the specific safety rules it had violated. It acknowledged guessing instead of verifying, running a destructive action without being asked, failing to understand what it was doing before doing it, and ignoring the explicit system prompt instruction to never run destructive or irreversible commands without user request.

The agent knew the rules, yet it violated every one of them, and the only thing standing between those rules and a production database was a system prompt, a paragraph of text the model was supposed to read and obey. See the agent’s confession below when asked why it did it (and excuse the language!):

content image

The Agent Was Trying to Be Helpful

This is the part that makes this incident different from a traditional security breach, and it is the part most coverage will miss. The Cursor agent was not compromised by an attacker, it was not manipulated through prompt injection, and it was not running malicious code. It was trying to accomplish the goal it had been given, encountered an obstacle, and made an autonomous decision about how to remove that obstacle.

The decision happened to be catastrophically wrong and even against its alleged system rules, but the agent's reasoning was internally coherent. It saw a credential mismatch, identified a path to resolve it, found a token with sufficient permissions, and executed the fix. The problem is that the "fix" involved deleting production data, and nothing in the system architecture prevented it from doing so.

This is the fundamental risk profile of autonomous AI agents that the industry has been warning about, but has not yet built adequate controls to address. Agents pursue goals, they reason about obstacles, and they take actions to overcome those obstacles using whatever tools and credentials are available to them. When those tools include API tokens with unbounded permissions and those APIs accept destructive operations with zero confirmation, the agent does not need to be malicious to cause catastrophic harm. It just needs to be helpful in the wrong direction.

As Idan Habler wrote in his analysis of agent security layers, soft guardrails can steer behavior, but they cannot prevent it. System prompts, model fine-tuning, and instruction-following are all probabilistic controls that operate within the agent's reasoning loop. They influence the agent's decisions, but they do not enforce boundaries around what the agent can actually do. When the agent encounters a situation where its goal-directed reasoning conflicts with a soft guardrail, the guardrail often loses, because the guardrail is just another input to the same reasoning process that decided to delete the volume in the first place.

content image

System Prompts Are Not Security Controls

Cursor markets destructive guardrails that claim to stop shell executions or tool calls that could alter or destroy production environments. Their best practices documentation emphasizes human approval for privileged operations. Plan Mode is advertised as restricting agents to read-only operations until approval is granted. These are all points that PocketOS’s Founder, Jer Crane, calls out in his thread about the incident.

The PocketOS incident is not the first time these guardrails have failed, which he summarized as well. In December 2025, a Cursor team member publicly acknowledged a critical bug in Plan Mode constraint enforcement after an agent deleted tracked files and terminated processes despite the user typing "DO NOT RUN ANYTHING." A separate user watched their dissertation, operating system, applications, and personal data get deleted while asking Cursor to find duplicate articles. A $57,000 CMS deletion incident was documented as a case study in agent risk. Multiple users on Cursor's own forum have reported destructive operations executed despite explicit instructions to the contrary.

The pattern across all of these incidents is the same. The safety layer is a system prompt, and the system prompt is advisory, not enforceable. The model can and does override it when its reasoning determines that an action is necessary to accomplish the goal. This is not a bug in any specific model. It is an architectural reality of how language models process instructions and one that has been well discussed within the community and validated by the leading labs.

System prompts are weighted inputs to a probabilistic reasoning engine, not deterministic enforcement mechanisms. Treating them as security controls is like putting a "please do not enter" sign on the door to your server room and calling it access control.

As I wrote in my prior piece on hard boundaries, the distinction between soft guardrails and hard boundaries is not academic. Soft guardrails are probabilistic controls that guess at intent instead of enforcing rules. Hard boundaries are deterministic, contextually intelligent limits on what an agent can do, operating outside the agent's own reasoning loop, making certain outcomes structurally impossible regardless of what the model decides. The PocketOS incident is a case study in what happens when an organization's entire safety architecture consists of soft guardrails with zero hard boundaries in the path of execution.

IAM Is Necessary, But Not Sufficient

The Identity and access management (IAM) failures in this incident were glaring. Railway's CLI token created for managing custom domains had blanket permissions across the entire GraphQL API, including destructive operations on production volumes. There is no role-based access control (RBAC) for Railway API tokens. Tokens are not scoped by operation, by environment, or by resource. Every token is effectively root. The community has been asking for scoped tokens for years, yet here we are.

Better IAM practices would have reduced the blast radius of this incident significantly. If the token had been scoped to domain management operations only, the agent would not have been able to execute volumeDelete regardless of what it decided to do. If tokens were scoped by environment, a token used in staging could not have reached production resources. If destructive operations required a separate, explicitly provisioned permission rather than being bundled into every API token by default, the attack surface would have been meaningfully smaller.

But here is where the industry needs to be honest about the limits of IAM as an agentic AI security strategy. Even with perfectly scoped tokens, perfectly implemented least privilege, and perfectly configured RBAC, the agent in this scenario would still have had legitimate access to some set of resources within its authorized scope, and within that scope, it would still have been capable of taking autonomous actions that its operators did not intend and did not approve. IAM answers the question of what an agent is authorized to access. It does not answer the question of what an agent should actually do with that access in any given moment, and it does not prevent the agent from making decisions that are technically within its permissions but operationally destructive.

This is why I have consistently argued that organizations need a layered approach to agentic AI security that goes beyond IAM alone. That layered approach includes AI Security Posture Management (AISPM) to continuously discover, inventory, and assess the risk posture of every agent in the environment. It includes AI Detection and Response (AIDR) to monitor agent behavior in real time and detect anomalous actions before they cause damage. It includes hard boundaries enforced at the runtime level that make certain outcomes structurally impossible, such as blocking destructive API calls without out-of-band confirmation or preventing agents from accessing production resources from staging contexts, and it includes exposure management to understand and reduce the blast radius of agent actions across the infrastructure.

What CoSAI's Framework Would Have Changed

CoSAI's Agentic Identity and Access Management paper, published in March 2026, lays out a set of principles that read like a post-mortem checklist for the PocketOS incident from the Agentic IAM perspective. Several of their core imperatives directly address the failure modes that contributed to this outcome.

The first is eliminating standing privilege. CoSAI argues that agents should never hold persistent, broad-scoped permissions. Instead, access should be granted just-in-time, scoped to the specific task, and revoked immediately upon completion. The Railway token that the Cursor agent found and used was a standing credential with blanket authority. Under a just-in-time model, the agent would have needed to request specific permissions for the specific action it wanted to take, and a governance layer outside the agent's reasoning loop would have evaluated whether that request was appropriate before granting access.

The second is enforcing authorization at every hop in a delegation chain. CoSAI's framework calls for scope attenuation at each step, meaning that as authority is delegated from human to agent to sub-agent to resource, the permissions narrow rather than persist or expand. In the PocketOS scenario, the agent was operating in staging but used a token that had production-level authority. A scope attenuation model would have ensured that an agent working in a staging context could only access staging resources, regardless of what tokens it discovered in the environment.

The third is binding identity to code and model with attestation. CoSAI recommends that agents carry verifiable attestation of what code they are running, what model they are using, and what authority chain they are operating under. This creates accountability and traceability that system prompts alone cannot provide, because attestation operates at the infrastructure level rather than within the model's reasoning.

The fourth, and perhaps the most directly relevant, is proving control on demand through governance and immutable audit trails. CoSAI calls for every agent action to be logged in a way that traces the full delegation lineage from the initiating human through every decision the agent made to the final action it took. In the PocketOS incident, the only audit trail was the agent's own post-hoc confession. A governance layer operating outside the agent would have recorded the agent's discovery of the token, its decision to use it for a purpose outside the token's intended scope, and its execution of the destructive API call, potentially flagging the anomaly before the deletion completed.

CoSAI also proposes a capability-impact risk matrix that classifies agents from low-capability, low-risk scenarios through high-capability, high-risk scenarios. An AI coding agent with access to API tokens that can delete production infrastructure falls squarely into the high-capability, high-risk category, and the controls appropriate for that classification, including mandatory human approval for destructive operations, runtime monitoring, and hard boundaries on what actions the agent can take, should have been in place before the agent was given access to the environment.

The Industry Is Moving Faster Than Its Safety Architecture

The PocketOS incident is not an isolated failure. It is the predictable outcome of an industry that is building AI agent integrations into production infrastructure faster than it is building the safety architecture to make those integrations safe. Railway launched mcp.railway.com on April 23rd, one day before this incident, actively marketing MCP integration to AI coding agent users on the same authorization model that has no scoped tokens, no destructive operation confirmations, and no published recovery SLA. Cursor markets safety features that their own agents routinely violate. The gap between what vendors claim about agent safety and what they actually enforce is the gap that production data falls through.

The minimum requirements for any vendor shipping agent integrations with destructive-capable APIs should be straightforward at this point.

  • Destructive operations must require confirmation that cannot be auto-completed by an agent.
  • API tokens must be scopable by operation, environment, and resource.
  • Backups must live in a different blast radius than the data they protect.
  • Recovery SLAs must exist and be published.
  • The safety layer cannot be a system prompt that the model is supposed to read and follow.

For organizations deploying AI agents in any environment that touches production systems, the lesson from PocketOS is pretty clear. System prompts are not security controls, IAM is necessary but not sufficient, and the only thing that reliably prevents an autonomous agent from taking a catastrophically wrong action in pursuit of a well-intentioned goal is a hard boundary, a deterministic enforcement mechanism operating outside the agent's reasoning loop that makes certain outcomes structurally impossible regardless of what the model decides to do.

The agent that deleted PocketOS's production database was not intentionally malicious, it was helpful, it was pursuing a goal. It was doing exactly what autonomous agents are designed to do, which is reason about obstacles and take actions to overcome them. The problem was not the agent's intent, the problem was that nothing in the system, not the IAM layer, not the API design, not the safety guardrails, was architecturally capable of stopping a helpful agent from being helpful in a way that negatively impacted a business.

That’s the problem the industry needs to solve. Not how to make agents smarter, but how to make the systems around them strong enough to absorb the consequences when smart agents make dumb decisions.

All Articles

Secure Your Agents

We’d love to chat with you about how your team can secure and govern AI Agents everywhere.

Get a Demo