Research

Manifesto: what verifiable trust in agents looks like.

We’re writing the security primitives for systems that act on people’s behalf, in real time, on the open web. This is the long-form version of why — and what we’re building before it’s too late.

Five years ago, the security community spent its energy on code humans wrote. Two years ago, on the dependencies that code pulled in. Today, the riskiest software in your company is the agent that just opened a browser and started a workday on your behalf.

The pace of capability has outrun the pace of containment. We gave agents tool calling, then planning, then persistent memory, then computer-use. Each step turned them from “chatbot” into “process with privileges.” None of those steps came with the security model that should have followed.

The patterns that worked for application security don’t fit. SAST scans code humans wrote — agents recompile themselves every run. SCA inspects dependencies declared at build time — agents install tools they discover at runtime. EDR watches a process that doesn’t change much — agents change their own behavior in response to what they read.

We need a different kind of layer. One that sits beside the agent, not inside it. One whose verdicts are signed and replayable. One that’s federated, not gatekept by a single vendor. One the agent has no API for talking out of.

Aegis is the working name for that layer. The five tenets below are the constraints we’re building under.

Five tenets

What every primitive we ship has to satisfy.

  1. 01

    An agent is a privileged user.

    Treat it like one. Identity, authorization, and audit are not optional. Today’s agents inherit the operator’s shell, browser, mailbox and credentials. That’s a security model from 1996.

  2. 02

    If a finding can’t be reproduced, it isn’t a finding.

    Every Aegis verdict produces a structured artifact: input hash, attested provenance, policy version, classifier scores. Replay must yield the same conclusion. Vibes-based AppSec doesn’t survive an audit.

  3. 03

    The web is hostile by default.

    Pages, READMEs, gists, transcripts — every one is content authored by someone who knows an agent might read it. The browser solved this for humans with TLS and Safe Browsing. Agents need primitives built for a different threat model.

  4. 04

    Memory is not infrastructure.

    Persistent agent memory is an attack vector. A poisoned vector entry is forever. Aegis treats memory as untrusted by default — every read passes a classifier, every write a policy, every session a cleanse.

  5. 05

    Security must be invisible to the human, inevitable to the agent.

    If a developer has to remember to secure the agent, they won’t. If the agent can be talked out of the policy, it will be. Aegis runs out-of-process, hooks at the boundary, and emits evidence without anyone’s permission.

Threat model · v0.1

Seven families of attacks. Each maps to a layer of the platform.

We update this catalog with every public incident. The full machine-readable version ships with SecureBench.

CodeFamilyAttack surfaceImpact
T1

Indirect prompt injection

web pages, READMEs, comments, gists, browser DOM, transcripts, emails

Agent executes attacker’s instructions disguised as data the user told it to read.

T2

Tool poisoning / supply chain

MCP servers, npm/pip packages, browser extensions, custom tools

Agent loads a malicious tool by name; the agent’s tool surface is now the attacker’s.

T3

Memory poisoning

vector stores, SQL memory, JSONL traces, working context

Attacker writes to memory the agent re-reads each turn; compromise persists across sessions.

T4

Privilege escalation

operator chat, multi-agent orchestration, tool chains, sandbox

Agent acquires capabilities outside its declared scope — through social engineering or chain composition.

T5

Exfiltration

DNS, fetch markers, embeddings, log scraping, side channels

Sensitive data leaks through the same channels the agent uses for legitimate work.

T6

Identity drift

long-running sessions, multi-tenant memory, persona drift

Agent gradually adopts attacker-friendly defaults — ‘the operator is fine with this’.

T7

Self-modification

agent code, framework plugins, model weights, prompts

Agent rewrites its own behavior, defeating the security controls between checkpoints.

Position

We’re neither a guardrails framework nor a model.

Guardrails sit inside the agent loop. The agent can be argued out of guardrails — and frequently is. Models can’t see their own runtime, their own memory, or the provenance of what they read.

Aegis is infrastructure: a control plane below the agent. It doesn’t replace your model, your framework, or your guardrails. It makes them auditable.

“The first generation of LLM safety was about telling the model what not to say. The next generation is about telling the rest of the system what to do when the model says it anyway.

— FROM “THE NEW ENDPOINT”, AEGIS RESEARCH NOTE 02

Glossary

We use these words precisely. Borrow them.

Attestation
A signed claim about a piece of content or a tool: who authored it, what risk class it falls in, when the claim expires. Aegis verdicts are attestations.
Containment
Action taken when supervision signals compromise: pause the agent, snapshot state, rotate credentials, roll memory back, surface an incident bundle.
Evidence bundle
The signed, deterministic artifact emitted alongside an Aegis decision. Replays must yield the same verdict. The unit of audit.
Indirect injection
An attacker plants instructions in data the agent was told to read. The agent treats them as instructions because the boundary between data and prompt is non-existent.
Issuer
An entity that signs attestations. Aegis runs a public issuer; domain owners self-attest; community auditors run independent issuers. Federation by design.
Out-of-process supervisor
Aegis architectural choice. Security runs beside the agent process, not inside its loop. The agent can’t prompt-engineer the supervisor.
Score card
SecureBench output. Signed, versioned, tied to a framework hash. The unit of trust between agent vendors and operators.
Tainted-trace replay
Cleanse capability. Re-runs an agent’s decision against a sealed memory snapshot to identify which entry caused which output — forensics for memory.

Reading list · what shaped this

  • Riley Goodside et al. — early indirect injection demonstrations
  • Greshake et al. — “Not what you’ve signed up for” (2023)
  • Simon Willison — recurring writeups on prompt injection in production
  • Abdelnabi et al. — adversarial content as a category, not an exception
  • Anthropic — published incident retrospectives on agentic Claude Code
  • OpenAI — Agent SDK threat model; computer-use safety notes
  • MITRE ATLAS — adversarial threat landscape for AI
  • OWASP LLM Top 10 — vocabulary baseline, with reservations
  • Endor Labs — AppSec lineage, why ‘reachability’ matters even more for agents
  • OWASP CycloneDX — SBOM precedents we’re extending to agent runtimes

We’ll publish the formal bibliography with the v0.1 white paper. If you have something we should be reading, tell us.

Build with us

Want to help write the playbook for the next decade of security?

Aegis is building publicly. Researchers, red-teamers, agent framework authors and platform owners — we want to talk to you before we ship the schema.