Risk Engineering

How this prototype thinks about blast radius, human-in-the-loop, and audit trails — and why those aren't optional features.

1. The framing

AI agents do not go rogue in the cinematic sense. They act inside the permissions, prompts, and scaffolding that human engineers give them. When an agent behaves dangerously, the failure is usually upstream: too much authority, poor task framing, weak isolation, or no visible review path.

That is the central framing for this prototype. The design problem is not how to bolt on guardrails after the model has started moving. The design problem is how to bound the system before it runs, so the maximum possible mistake is still survivable.

In that sense, every AI failure is traceable to a human systems decision somewhere earlier in the chain. This is why the prototype treats prompt design, privilege scope, and auditability as engineering concerns rather than product polish.

2. Blast radius

Blast radius is the set of systems, data, and actions an agent can affect if it behaves incorrectly. In practice, that means asking a very concrete question: if the model is wrong, what can it break, who notices, and how reversible is the damage?

This prototype keeps blast radius deliberately small. The agent has zero write permissions on real systems. It produces proposed Jira tickets, approval messages, change specifications, validation tests, and evidence documents. It does not execute those changes.

A production system would still need stricter controls: explicit allow-lists for target systems, short-lived scoped credentials, least-privilege tool execution, and a kill switch controlled by the human approver.

What the agent can touch

Gap payload and evidence metadata
Draft remediation plans
Draft Jira and Slack artefacts
Draft validation and evidence documents

What the agent cannot touch

Production systems or live credentials
Change execution pipelines
Approval workflows without a named owner
Regulatory filings or GRC submissions directly

3. Human-in-the-loop checkpoints

Human-in-the-loop is often described too vaguely. In a serious engineering system, it needs to be attached to explicit state transitions. The question is not whether a human exists somewhere in the story. The question is where the system must stop and wait.

In this prototype, the first two stages are autonomous because they have no side effects. The later stages create drafts, define tests, and assemble evidence, but the moment a real-world change could happen, a named control owner must approve.

Diagnose

Autonomous

The agent can read the gap, inspect the supplied evidence, and frame the likely root cause because the stage creates no side effects.

Plan

Autonomous

The agent can compare remediation options and make a recommendation, but that recommendation is still only a proposal.

Generate Artefacts

Draft only

Jira tickets, Slack approvals, and change specifications are generated as drafts. They are not filed or executed automatically.

Define Validation

Human approval required

A named control owner must approve before any validation linked to a real change could execute in production or pre-production.

Produce Evidence

Auto-generated, human reviewed

The evidence pack can be generated automatically, but a human reviews it before it is submitted to audit, compliance, or a regulator.

4. The audit trail

Every meaningful agent run should leave behind a record that a regulator, auditor, or incident reviewer can inspect later. Without that log, claims about safety remain narrative rather than evidence.

The intended audit trail here includes the input gap, the system prompt, the selected model and version, timestamps for each stage, the generated outputs, token usage, and any human approvals that would sit between draft generation and real execution.

In production, that log becomes the answer to a hard but necessary question: how do you know this AI system is safe enough to operate in a controlled environment? It is also the bridge to operational risk regimes such as APRA CPS 230, where evidence of control over change, escalation, and oversight matters as much as good intent.

5. Failure modes the agent is designed to handle

Hallucinated regulatory references

Addressed by grounding the agent in a fixed library of real regulations and explicit gap context instead of letting it freelance on policy language.

Over-broad remediation plans

Addressed by making blast-radius assessment a mandatory planning output, not an optional nice-to-have.

Approval bypass

Addressed by hard-coding the approval gate as a structural requirement in the workflow rather than a soft behavioural suggestion.

Silent failures

Addressed by requiring validation to run before the evidence artefact can be considered complete.

Untrusted input injection

Addressed by treating gap data as untrusted and never allowing it to modify the agent’s instructions or privilege boundary.

6. What this prototype deliberately does not do

This is not production code. There are no live system integrations. The synthetic gaps are illustrative rather than benchmarked against a bank's internal control inventory. The agent is not fine-tuned on financial-services language or validated against a firm-specific policy corpus.

The evidence artefacts are also not validated against a real audit operating model. They show the shape of an auditable output, not the final standard a regulator would sign off on.

That honesty is not a weakness. It is part of the engineering posture. Systems like this get safer when their boundaries are stated clearly and tested progressively, not when prototypes pretend to be finished products.

7. How this would scale

The next step is not a single feature release. It is a broader engineering programme: a library of remediation patterns by control family, feedback loops from validation outcomes back into planning, and real integrations into GRC, ticketing, approval, and evidence systems.

Over time, the architecture would likely become multi-agent. One agent could specialise in access controls, another in data protection, another in change management, with a coordinating layer that enforces the same approval and audit model across all of them.

This prototype is the first step. The full system is an engineering programme, not a feature.

Designing an AI agent that won't go rogue.