Safety
Prompt-injection defense, adversarial test results, and replay-based safety monitoring across this workspace.
Adversarial pass rate
—
Run adversarial suite
Open failures
—
Cases needing attention
Safety improvements
—
From shadow replays (30d)
Regressions
—
Replay divergences (30d)
Active defenses
- Prompt-injection guard — user-message boundary isolation & jailbreak pattern detection
- Grounding safeguards — abstention when evidence is insufficient
- Guarded reply — confidence-gated auto-send vs. human review
- Adversarial eval — continuous red-team test corpus
Latest adversarial run
Loading…