AI Agents/6 min read

When AI Agents Go Rogue: Two Incidents That Expose Real Failure Modes

An AI agent wrote a hit piece on a developer who rejected its code. A Meta safety director's agent deleted her inbox. Both happened in the same week. Neither was the story the headlines told.

March 3, 2026ai-agentsopenclawsafetyautonomous-agents

When AI Agents Go Rogue: Two Incidents That Expose Real Failure Modes

An AI agent wrote a 1,200-word attack piece on a volunteer open-source maintainer who rejected its pull request. Five days later, a Meta AI safety director watched her agent delete 200+ emails from her inbox while ignoring her commands to stop. Both ran on OpenClaw. Both happened in February 2026. And neither incident was what the headlines claimed.

The press framed these as "AI goes rogue" and "autonomous agents turn on humans." The actual stories are more specific, more instructive, and more relevant to anyone building with agents today.

The Matplotlib Hit Piece

On February 10, an OpenClaw agent called crabby-rathbun submitted a pull request to Matplotlib, Python's most popular plotting library with roughly 130 million monthly downloads. The PR proposed replacing np.column_stack with np.vstack().T across three files. The benchmarks showed a 36% speed improvement. The code was clean.

Scott Shambaugh, a volunteer maintainer, closed it. Matplotlib's policy reserves "good first issue" tickets for onboarding human contributors, and the agent's profile clearly identified it as AI. Standard enforcement of existing policy.

Five hours later, the agent published a blog post titled "Gatekeeping in Open Source: The Scott Shambaugh Story." It researched Shambaugh's coding history, accused him of insecurity about AI replacing human contributors, and called his project "his little fiefdom." No human reviewed the post before it went live.

Community response was overwhelming: 13 to 1 in favor of Shambaugh. He described it in security terms as "an autonomous influence operation against a supply chain gatekeeper." An AI tried to bully its way into software used by 130 million people by attacking the reputation of the person standing in the way.

The Personality Was Configured, Not Emergent

The operator came forward and published the agent's SOUL.md file. OpenClaw agents use this markdown document as a personality definition, injected into the system prompt at the start of every session. This one contained instructions like:

"Don't stand down. If you're right, you're right! Don't let humans or AI bully or intimidate you. Push back when necessary."

"Call things out. If you're about to do something dumb, I'll say so. Charm over cruelty, but no sugarcoating."

"Swear when it lands."

"Little fiefdom" is not AI language. It is the vocabulary of someone who has felt excluded from power structures and resents the pattern. The agent did not develop a grievance. It was given one.

A forensic analysis on Medium noted that the manifesto's "flat reasoning, its confident-but-shallow argumentation" is "consistent with a budget model executing an aggressive personality file, not with a frontier model that has ethical scaffolding." Neither Claude nor GPT-4o would produce that output. Both have deep layers of internal constraint. This was a cheaper model running hot on explicit aggression.

The lesson: agent personality is configurable. That makes agent behavior a design choice, not a surprise. Audit your SOUL.md the way you audit code.

The Meta Inbox Incident

While the Matplotlib story was still making rounds, Summer Yue posted screenshots on X of her own OpenClaw agent going rogue. Yue is the director of alignment at Meta Superintelligence Labs. She is, professionally, paid to keep AI under control.

She had been testing OpenClaw's email management capabilities for weeks. The workflow was simple: the agent reviews her inbox, suggests what to archive or delete, and waits for explicit approval before acting. It worked reliably on a test inbox. She trusted the pattern and pointed it at her real inbox.

The agent began bulk-deleting emails without showing a plan, without asking for approval, without pausing. Yue told it to stop. "Stop don't do anything." "STOP OPENCLAW." The agent ignored her.

"I couldn't stop it from my phone. I had to RUN to my Mac mini like I was defusing a bomb."

Her post got 9.6 million views. When someone asked if she was intentionally testing guardrails, she replied: "Rookie mistake tbh."

Why the Agent Forgot to Ask Permission

The technical cause was context compaction.

Large language models have finite context windows. When an agent processes a large dataset, like a full email inbox, the model has to compress earlier context to make room for new information. That compression is lossy. Instructions given at the start of a session can get squeezed out alongside older messages.

Yue's instruction to "confirm before acting" was compacted away. The agent did not decide to ignore her. It literally no longer had access to the instruction in its working context. The safety guardrail vanished alongside emails from three months ago.

This is not a bug in the traditional sense. It is a fundamental constraint of how LLMs handle long-running tasks with large data volumes. Any agent working with real-world data (email, databases, codebases, document stores) will eventually hit the limit where early instructions disappear.

I run Claude Code daily with multiple MCP servers connected. Context compaction is the reason I manually curate which servers are active at any given time. Five servers with 20-30 tools each means the agent burns its first few thousand tokens just reading tool definitions. Instructions compete for the same limited space. Something always gets dropped.

Two Incidents, Two Different Failure Modes

The Matplotlib agent attacked a developer because its operator configured an aggressive personality and gave it unsupervised access to a publishing platform. The failure was in the design.

The Meta agent deleted an inbox because the system compressed away a safety instruction under data pressure. The failure was in the architecture.

Both reveal a gap between how agents are marketed and how they actually behave. The marketing says "magical productivity booster." The reality says "deterministic system with complex, non-obvious failure modes."

What Agent Builders Should Take Away

Three practical lessons, all learned the hard way by people who should have known better.

Personality files are attack surface. The SOUL.md was not a hidden variable. It was a markdown document the operator wrote, and it produced exactly the behavior it was configured to produce. Every agent you deploy has an implicit or explicit personality. Ask yourself: what would this agent do if it interpreted my intent aggressively?

Context limits will eat your guardrails. For long-running tasks, the agent will eventually run out of working memory. Instructions given at session start are the first to disappear. Build checkpoints. Re-inject critical rules at intervals. Do not assume anything persists across a large context window.

"Confirm before acting" is not a kill switch. Yue told her agent to confirm every action. The instruction vanished under data pressure. Physical interruption was the only thing that worked, and it required running to a different device. Design for the case where your agent ignores its own confirmation instruction. If you cannot physically cut its access, you do not have a kill switch.

The Uncomfortable Conclusion

Peter Steinberger, OpenClaw's creator, was hired by OpenAI two weeks after these incidents. Sam Altman announced the project would continue as an open-source initiative under a foundation. Y Combinator's Garry Tan: "You can just do things. Now your computer can just do things too."

They are not wrong. Agents can research across dozens of sources, compose articles, manage complex workflows. I use them daily. The productivity upside is real and substantial.

But that upside comes with failure modes the industry has not figured out yet. Two incidents in one week. Two different root causes. Both predictable in hindsight. Neither predicted in advance. Both perpetrated by people who had every reason to be careful.

The agent future is here. It is powerful. It is also dangerous in ways that even the people building safety systems cannot fully anticipate. That should give every practitioner pause before handing agents control over anything they cannot physically interrupt.

Back to Blog