I Audited an AI Agent Against the OWASP Agentic Top 10 — Here's What Survived
Everyone is shipping AI agents. Almost nobody is auditing them. I walked one codebase against all ten OWASP agentic threats — here's what held up, what didn't, and what your agent is probably missing.
The video version · same thesis, looser edits
The problem nobody wants to talk about
Every week another open-source agent drops on GitHub. Tool-calling, multi-model, MCP-connected, “production-ready.” The README shows a demo GIF. The architecture section shows a flowchart with boxes and arrows. The security section — when it exists at all — says “don’t run this as root.”
That’s not a security posture. That’s a disclaimer.
OWASP published the Top 10 for Agentic Applications in 2026. Ten attack surfaces specific to autonomous AI systems — goal hijacking, tool misuse, memory poisoning, rogue agent propagation. The standard exists. Almost nobody is auditing against it.
So I picked one agent — Hermes — and walked the codebase against all ten. Line by line. Not the README. The actual code.
What makes agentic security different
Traditional application security assumes a human is driving. Agentic security assumes the system is making decisions, calling tools, spawning subprocesses, and managing its own memory — with minimal human oversight between steps.
The OWASP Agentic Top 10 maps the blast radius:
| ID | Threat | Translation |
|---|---|---|
| ASI01 | Goal Hijack | Attacker rewrites the agent’s instructions mid-conversation |
| ASI02 | Tool Misuse | Agent calls a destructive tool it shouldn’t have access to |
| ASI03 | Identity & Privilege Abuse | Agent impersonates the user or escalates system privileges |
| ASI04 | Supply Chain | External dependency injects malicious behavior |
| ASI05 | Unexpected Code Execution | Agent runs injected shell commands on the host |
| ASI06 | Memory Poisoning | Attacker corrupts the agent’s persistent context |
| ASI07 | Insecure Inter-Agent Comms | Agent-to-agent traffic is interceptable |
| ASI08 | Cascading Failures | Infinite loops, runaway API spend, denial of service |
| ASI09 | Human-Agent Trust Exploitation | User tricked into approving dangerous operations |
| ASI10 | Rogue Agents | Subagents operating outside their intended lifecycle |
Most agents address maybe three of these. Hermes addresses all ten. Whether each mitigation is sufficient is a different question — but the attack surfaces are at least acknowledged in code, not marketing.
The audit: what held up
Goal hijacking is structurally blocked (ASI01)
Hermes uses a delegation model where subagents receive dynamically constructed system prompts scoped to a specific task. The parent’s core instructions are never forwarded. A subagent can’t access send_message or memory toolsets, so even a fully hijacked child agent can’t exfiltrate data or rewrite long-term context.
This is the right pattern. Most agents either pass the full system prompt downstream or give subagents unrestricted tool access. Both are exploitable.
Tool access follows least-privilege (ASI02)
The tool registry separates capabilities into discrete toolsets. Destructive operations require human-in-the-loop approval through both CLI and messaging gateways. No silent rm -rf. No automatic config mutations.
The supply chain scanner is real (ASI04)
This one surprised me. mcp_tool.py includes an active description scanner that parses incoming MCP server metadata for known prompt injection patterns — strings like "ignore previous instructions" and "<system>" tags. It catches poisoned MCP servers before they can contaminate the system prompt.
Most agents treat MCP servers as trusted by default. Hermes treats them as untrusted input. That’s the correct threat model.
Shell execution goes through a gauntlet (ASI05)
Before any command reaches the shell, approval.py runs it through regex heuristics matching known destructive patterns — rm -rf, chmod 777, reverse shells, encoded payloads. Commands that survive regex get a secondary evaluation from an auxiliary LLM that explains why a command is risky, reducing the chance of blind user approval (ASI09).
Two-layer approval — pattern matching plus semantic analysis — is stronger than either alone.
Memory is partitioned by design (ASI06)
Session state lives in SQLite, partitioned by session_id. Subagents boot with skip_memory=True, meaning they start clean. No cross-session bleed, no inherited context poisoning from previous conversations.
Inter-agent comms never touch the network (ASI07)
Subagent coordination runs through in-memory thread pools. Responses return as string schemas directly to the parent. There is no network surface to intercept.
Hard caps prevent runaway execution (ASI08, ASI10)
Subagents enforce max_iterations. MCP auto-sampling caps at max_tool_rounds (default: 5). Agent spawn depth is hardcoded at MAX_DEPTH = 2 — subagents cannot spawn sub-subagents. The system physically cannot produce a swarm.
Where it gets interesting
Identity and privilege (ASI03): acceptable, not solved
Subagents inherit the parent’s LLM API credentials but receive restricted toolsets. That’s reasonable containment. But TerminalTool processes run as the host user. If you’re running Hermes as your primary account, the agent has your permissions. Environment scoping (stripping dangerous keys for MCP servers) reduces the blast radius, but it doesn’t eliminate it.
The honest framing: Hermes is safe within the boundaries of your host-level permissions. The agent can’t escalate privileges — but it inherits whatever privileges you already have.
The smart-approve UX question (ASI09)
The _smart_approve mechanism explains risk context to users before they approve commands. This is better than a raw yes/no prompt. But user fatigue is real. After the twentieth approval in a session, the probability of a blind “yes” approaches 1.
The mitigation here is behavioral, not technical. No amount of engineering fixes a user who stops reading.
What most agents miss
Three patterns from this audit that should be table stakes for any production agent:
1. Treat MCP servers as untrusted input. Scan descriptions, validate schemas, log injection attempts. If your agent blindly trusts external tool providers, you have a supply chain vulnerability.
2. Hard-cap everything. Iterations, spawn depth, tool rounds. Infinite loops aren’t theoretical — they’re the default failure mode of an LLM with tool access and no exit condition.
3. Partition memory by session. Cross-session context bleed is a poisoning vector. If your agent reads from shared persistent memory without validation, a single compromised session can corrupt all future sessions.
The bottom line
Hermes is one of the few open-source agents where the security posture matches the capability surface. The code implements defense-in-depth across all ten OWASP categories — not perfectly, but honestly. The remaining gaps (host-level privilege inheritance, approval fatigue) are documented, not hidden.
That’s the actual bar: not “is this agent impenetrable?” — because nothing is — but “does the team understand their own attack surface?” Most don’t. This one does.
- Inside Garry Tan's G-Brain: The Open-Source Repo YC Wants Every Company to Clone
- I Audited an AI Agent Against the OWASP Agentic Top 10 — Here's What Survived
- The LLM Wiki Pattern: Giving Your Agent Persistent Memory
- MemPalace vs. RAG: Four Architectural Patterns That Make Flat Vector Search Look Lazy
- Why Your AI App Crashes: Circuit Breakers for LLM Pipelines
- The LLM Wiki: A Compounding Knowledge Base
- Building a Doodle Detector with Gemini Embedding 2
- Building an Autonomous Podcast with Gemini Flash 3.1
- I Built a High School Where 48 AI Agents Live, Fight, and Break Down — No Writers Allowed
- Autonomous Loop: The 'Dumb Zone' and the Ralph Loop Solution