2026-03-04
Simon Willison's lethal trifecta for AI agents states that data theft for AI agents is straightforward as soon as you combine three capabilities:
Capability 1: Access to private data
Capability 2: Exposure to untrusted content
Capability 3: The ability to externally communicate
Basically: if your conversation context can get poisoned (Capability 2) and it's possible to exfiltrate (Capability 3) sensitive data (Capability 1), then your sensitive data may not stay private forever.
Simon's theory is a useful mental model to reason about agents and sandboxes. Generally, if you can limit just one of these capabilities, you have a plausible basis for a reasonably secure AI agent.
However, when it comes to AI coding agents in corporate environments, your choice has already been made.
Consider a concrete scenario: a developer asks their coding agent to integrate a new open-source library. The agent fetches the library's README, which contains an injected prompt in a Markdown comment, and begins sending source code to an attacker-controlled domain. All capabilities of the trifecta are met. To mitigate this, we'll have to remove some capabilities.
Most companies are not working on open-source code and would not enjoy having their source code up for auction. AI coding agents are, fundamentally, operating on private data. It's not a removable capability.
It's also difficult to argue that an AI agent doesn't have some amount of untrusted input. During research or planning phases, the agent may reference open-source code or do some WebSearch and WebFetch work. All of this may poison the conversation context in some way and trick the agent into running some tool or writing some code that does something malicious.
Practically, the only knob we can turn here is limiting external communication. How though?
One approach is to split up tasks into one of two buckets:
For research-oriented tasks you want a sandbox with Internet access but basically no access to internal or proprietary data like source code, internal services, or secrets.
For development-oriented tasks you want a sandbox with access to code but no Internet access. You can use output from the research task but you'll likely want some guardrails (e.g. human review, sub-agents for checking for odd content) to make sure untrusted data isn't affecting development.
This is a helpful approach as a starting point especially for smaller teams. You can implement a basic version yourself with basically no additional tooling. For research-oriented tasks, a chat workflow with no host-level access (e.g. in the web browser) would suffice. For development tasks, features like "Claude Code on the Web" with no network access seem like a promising option1.
A big flaw with this approach is that it's convention and not a control. It relies on engineers adhering to a specific development workflow. This may work for a smaller team, especially if you can make it standardized and useful (e.g. internally managed plugins that encourage this split workflow), but if this is an important security control that needs to minimize the chance of human error, you want something more centrally managed.
A more rigorous approach:
Ensure that the development environment has centralized egress with domain allowlisting (e.g. smokescreen).
Put the egress proxy in "monitor" or "logging only" mode for some weeks. You'll end up with a list of hostnames to add to an allowlist.
Add those hosts to the allowlist and find someone thoughtful to sign off on changes and regularly clean up the list.
Now you have a centralized funnel where outbound traffic is limited to domains that seem reasonable. An agent that decides to send customer data to sketch.biz is flat-out rejected. If you have ever been in an incident where you are ruling out data exfiltration (e.g. supply chain malware, widespread code execution vulnerabilities) an egress proxy gives you a smaller explicit list to analyze.2
The upside, compared to the "two bucket" approach, is that you no longer have to be prescriptive about someone's development workflow. Engineers can let their agent rip in their development environment and still have some reasonable guardrails.
The downside is that you have infrastructure and security policy to maintain. Your security posture is a bit dependent on the size and flexibility of your allowlist, and you'll want to review and prune it on a regular cadence. Your goal is a minimally useful set of domains, not a large and sprawling list.
As you scale, you may want to lock this down even further and extend this into more specialized infrastructure. Don't want your development environment pulling random packages from NPM? Remove it from the allowlist and host your own internal package manager that only contains the packages that have passed some dependency cooldown period or code review. Don't want your development environment pushing to random GitHub.com projects? Clone your repository into the development environment before enabling the sandbox and remove GitHub.com from the allowlist. The important first step here is that you have a centrally managed and explicit list that you can burn down.
If your team is running AI coding agents on private code, you've already conceded two of the three trifecta conditions. Private data and untrusted input are baked in. External access is the primary knob to twist.
For small teams: start with some shared skills and development patterns but it's guidance rather than a hard security control.
For something more rigorous: start with an egress proxy in logging mode, build your allowlist from what you observe, and find someone to regularly review it and try to keep it lean.
Claude Code on the Web is currently a research preview. It, at least based on the documentation and my experience, solves a few tricky problems with a "No network" development sandbox: you still need to push changes to version control and chat with the LLM API. Claude Code on the Web solves the former by acting as a proxy for all GitHub.com operations and exposing a limited subset of operations to the sandbox. ↩︎
While not entirely related to this post: egress proxies are a useful security control generally and having the operational muscles to maintain them in production can be helpful as well. Does your product have a small set of external API dependencies? If you allowlist just that small set of hosts, and log failed connection attempts to other hosts as suspicious, you've made useful production exploitation more difficult. ↩︎