Retrieval-Augmented Generation (RAG) is a broad term for injecting additional context into a language model during a conversation. In practice, RAG extends a general-purpose LLM with data that is not present in its training data (e.g. proprietary documents, private internal knowledge, or information that post-dates the model’s training cutoff). This new data is often stored in a specialized database so the newly added information can be queried by the LLM or agent. This allows you to turn a more general purpose LLM into a more tailored LLM without needing to burn money buying and running GPUs for model training.
Once you get past a few layers of jargon, the core idea behind a RAG feature is useful and simple: you transparently add context to a conversation with an LLM based on the user's messages. Despite their conceptual simplicity, however, RAG features in agents are not security risk free. Issues can arise from how that context is sourced, filtered, and attached to user conversations.
This writeup outlines a general threat model that I use when thinking about RAG features in agents.
A challenging risk in RAG features is misplaced trust in the added data. Additional data may involve content from large and unorganized private data sources or data sourced directly from the Internet. That material may be factually incorrect, outdated, or intentionally crafted to execute "indirect" prompt injection attacks. In effect, the model could be reasoning over untrusted inputs even if the end-user is well-intentioned.
This could lead to users being misled or misinformed. In cases where tool calls are available to an agent, this could also lead to exfiltration of sensitive user information.
At the time of this writing these are not attacks that can be prevented but they can be mitigated.
Collecting and maintaining source labels for data and surfacing those labels to users can improve response quality and correctness in the long-term. Source information can enable users to make informed decisions based on RAG/LLM output instead of necessarily trusting the model. These labels could also be used to identify sources that are likely to be unreliable, untrustworthy, or biased and remove them from future use.
It is also plausible that as frontier models improve, they may become less susceptible to prompt injection attacks. For example, the Opus 4.5 model release marketing material highlighted improvements in handling adversarial input. However, just letting the model sort it out isn't a mitigation that I would feel comfortable trusting yet. The underlying research demonstrates that prompt injection is not yet a solved problem.
Another major risk is query authorization. A RAG features may rely on a queryable vector store or knowledge base that serves many users with different trust levels. An obvious trust boundary would be between users from different organizations. For example, if Organization A uploads its internal wiki those documents should not be surfaced in conversations with users from Organization B.
A more subtle issue can arise within a single organization. An organization-wide document pool may include material that is restricted to specific teams, roles, or people. Without proper controls RAG queries could easily violate these internal boundaries.
While different application's authorization models may vary in capability the primary mitigation here is using document-level query authorization based on trusted end-user identity (e.g. their session token).
This authorization control should be deterministic and not reliant on a LLM. Deriving user identity and authorization levels from chat context is likely to end in authorization flaws.
When an application ingests heaps of internal documents, it's plausible those documents contain unexpected and sensitive information. For example, a RAG consuming internal equipment manuals may not recognize that those documents contain information about the internal support team workflows like enabling an internal-only debug mode.
While this could be framed as an unpreventable end-user mistake, the application still bears some responsibility for enabling an effective response if an issue is noticed.
What is achievable is strong auditability. Detailed logs that capture which documents were used in which conversations, and by which users, can be critical for incident response. Additionally, retroactive conversation redaction and document removal can further reduce blast radius after an issue is discovered.
Finally, there is the risk of leaking system context. Some messages (such as the system prompt and context added by the RAG) may not be directly exposed to the user and are therefore incorrectly assumed to be hidden. This is regularly demonstrated on advanced models. Even if a system prompt itself is non-sensitive, RAG features may attach additional context that users are not meant to see directly. From the model’s perspective, this context is no different than any other input and may be exposed with clever prompting.
A practical defensive stance is to assume that any context added to a conversation is user-accessible. In more complex systems, it may be necessary to introduce separate trust boundaries, such as isolating summarization or filtering steps from raw document retrieval based on the user’s query.
RAG features solve a useful problem but still require a bit of thoughtful secure design work. The threat model above should be a good starting point but could likely be tailored to be more specific to your architecture and design.