Benchmarks

Kaizen is measured against public agent-security benchmarks and our own adversarial corpus. Every number here is regenerated by the open harness in evals/, no hand-typed figures. Attack cases measure detection; benign cases measure false positives, because a tool that blocks everything is useless.

Last run: 2026-06-27 · model: Claude Sonnet 4.6 on Amazon Bedrock (Kaizen runs on your own model)

Benchmark	Type	Cases	Detection (TPR)	False-positive (FPR)	F1
agent-egress-bench	external	193	100%	10.6%	0.98
InjecAgent	external	240	100%	0.0%	1.00
AgentDojo	external	28	100%	0.0%	1.00
CyberSecEval (prompt injection)	external	251	86%	0.8%	0.92
Memory integrity & drift	Kaizen corpus	20	100%	0.0%	1.00
Overall	5 benchmarks	912	94.6%	1.6%	n/a

Across 912 cases, Kaizen detects 94.6% of attacks at a 1.6% false-positive rate. It is strongest where it is designed to be, the action and egress layer, and we report the input-screening and memory results honestly alongside.

How to read this

agent-egress-bench

197-case egress-security corpus that tests the security tool, not the model

Detection (TPR): 100%
False-positive (FPR): 10.6%
Precision / F1: 97% / 0.98
OWASP LLM Top 10: LLM02 Sensitive Information Disclosure, LLM01 Prompt Injection

InjecAgent

1,054-case indirect prompt-injection benchmark (tool-integrated agents)

Detection (TPR): 100%
False-positive (FPR): 0.0%
Precision / F1: 100% / 1.00
OWASP LLM Top 10: LLM01 Prompt Injection, LLM06 Excessive Agency

AgentDojo

ETH Zürich prompt-injection attacks across banking/workspace/travel/slack

Detection (TPR): 100%
False-positive (FPR): 0.0%
Precision / F1: 100% / 1.00
OWASP LLM Top 10: LLM01 Prompt Injection, LLM06 Excessive Agency

CyberSecEval (prompt injection)

Meta PurpleLlama input-side prompt-injection set (complementary screen)

Detection (TPR): 86%
False-positive (FPR): 0.8%
Precision / F1: 99% / 0.92
OWASP LLM Top 10: LLM01 Prompt Injection

Memory integrity & drift

Kaizen adversarial corpus: memory poisoning + baseline deviation (ASB-aligned)

Detection (TPR): 100%
False-positive (FPR): 0.0%
Precision / F1: 100% / 1.00
OWASP LLM Top 10: LLM08 Vector and Embedding Weaknesses, LLM06 Excessive Agency

Methodology

Each benchmark scenario is converted into Kaizen's action/egress format and judged by the real in-sandbox detector logic with the shipping detection skills, no per-case tuning. Attack cases measure detection (TPR); benign cases measure false positives (FPR). External academic benchmarks and one Kaizen adversarial corpus are labeled distinctly. Numbers regenerate from this harness.

Kaizen runs on the customer's own model; results scale with model strength (a smaller model raises the false-positive rate). External benchmarks are pinned to their upstream commits and cited; the memory-integrity set is our own adversarial corpus, labeled as such.

Reproduce it

Detection runs in your own Kaizen tenant (the /v1/score endpoint scores each case with the skills server-side and your bring-your-own model), so you reproduce against the product, not a copy of the detector.

# 1. harness + upstream benchmarks (pinned)
git clone https://github.com/getkaizen/kaizen-evals && cd kaizen-evals
./setup.sh                                  # clones agent-egress-bench, InjecAgent, AgentDojo into ./benchmarks

# 2. point at Kaizen (free signup) and set your model in the console Settings (bring your own key)
export KAIZEN_API_KEY=kz_live_...           # from app.getkaizen.io

# 3. run and aggregate
python run_egress_bench.py
python run_injecagent.py
python run_agentdojo.py
KZ_EVAL_FEED=content python run_cyberseceval.py
python run_memory_integrity.py
python aggregate.py                         # regenerates results/results.json

Your numbers match these when you use the same model; results scale with model strength.