ai-observability Archives - maddaisy - cultivating innovation

When maddaisy examined the agentic AI governance gap last week, the focus was on a structural mismatch: three-quarters of enterprises planning to deploy agentic AI, but only one in five with a mature governance model. That gap remains wide. But a more specific — and arguably more dangerous — operational risk is now coming into focus: agentic AI systems do not fail suddenly. They drift.

A recent analysis published by CIO makes the case plainly. Unlike earlier generations of AI, which tend to produce identifiable errors — a wrong classification, a hallucinated fact — agentic systems degrade gradually. Their behaviour evolves incrementally as models are updated, prompts are refined, tools are added, and execution paths adapt to real-world conditions. For long stretches, everything appears fine. KPIs hold. No alarms fire. But underneath, the system’s risk posture has already shifted.

The Problem with Demo-Driven Confidence

Most organisations still evaluate agentic AI the way they evaluate any software feature: through demonstrations, curated test scenarios, and human judgment of output quality. In controlled settings, this looks adequate. Prompts are fresh, tools are stable, edge cases are avoided, and execution paths are short and predictable.

Production is different. Prompts evolve. Dependencies fail intermittently. Execution depth varies. New behaviours emerge over time. Research from Stanford and Harvard has examined why many agentic systems perform convincingly in demonstrations but struggle under sustained real-world use — a gap that grows wider the longer a system runs.

The result is a pattern that will be familiar to anyone who has managed complex software in production: a system passes all its review gates, earns early trust, and then becomes brittle or inconsistent months later, without any single change that clearly broke it. The difference with agentic AI is that the degradation is harder to detect, because the system’s outputs can still look reasonable even as the reasoning behind them has shifted.

What Drift Actually Looks Like

The CIO analysis includes a telling case study from a credit adjudication pilot. An agent designed to support high-risk lending decisions initially ran an income verification step consistently before producing recommendations. Over time, a series of small, individually reasonable changes — prompt adjustments for efficiency, a new tool for an edge case, a model upgrade, tweaked retry logic — caused the verification step to be skipped in 20 to 30 per cent of cases.

No single run produced an obviously wrong result. Reviewers often agreed with the recommendations. But the way the agent arrived at those recommendations had fundamentally changed. In a credit context, that difference carries real financial and regulatory consequences.

This is the nature of agentic drift: it is not a bug. It is the predictable outcome of complex, adaptive systems operating in changing environments. Two executions of the same agent with the same inputs can legitimately differ — that stochasticity is inherent to how modern agentic systems work. But it also means that point-in-time evaluation, one-off tests, and spot checks are structurally insufficient for production risk management.

From Policy to Diagnostics

When maddaisy covered the shadow AI governance challenge earlier this month, one theme was clear: governance frameworks are necessary but not sufficient. They define ownership, policies, escalation paths, and controls. What they often lack is an operational mechanism to answer a deceptively simple question: has the agent’s behaviour actually changed?

Without that evidence, governance operates in the dark. Policy defines what should happen. Diagnostics establish what is actually happening. When measurement is absent, controls develop blind spots in precisely the live systems where agentic risk tends to accumulate.

The Cloud Security Alliance has begun framing this as “cognitive degradation” — a systemic risk that emerges gradually rather than through sudden failure. Carnegie Mellon’s Software Engineering Institute has similarly emphasised the need for continuous testing and evaluation discipline in complex AI-enabled systems, drawing parallels to how other high-risk software domains manage operational risk.

What Practitioners Should Watch For

The emerging consensus points toward several operational principles for managing agentic drift:

Behavioural baselines over output checks. No single execution is representative. What matters is how behaviour shows up across repeated runs under similar conditions. Organisations need to establish baselines — not for what an agent should do in the abstract, but for how it has actually behaved under known conditions — and then monitor for sustained deviations.

Separate configuration changes from behavioural evidence. Prompt updates, tool additions, and model upgrades are important signals, but they are not evidence of drift on their own. What matters is persistence: transient deviations are often noise in stochastic systems, while sustained behavioural shifts across time and conditions are where risk begins to emerge.

Treat agent behaviour as an operational signal. Internal audit teams are asking new questions about control and traceability. Regulators are paying closer attention to AI system behaviour. Platform teams are under growing pressure to demonstrate stability in live environments. “It looked fine in testing” is no longer a defensible operational posture, particularly in sectors — financial services, healthcare, compliance — where subtle behavioural changes carry real consequences.

The Observability Gap

This is, ultimately, the next chapter in the governance story maddaisy has been tracking. The first chapter — covered in the enforcement era analysis — was about moving from principles to rules. The second, examined through Deloitte’s enterprise data, was the gap between strategic confidence and operational readiness. This third chapter is more specific and more technical: the gap between having governance frameworks and having the observability infrastructure to make them work.

The goal is not to eliminate drift. Drift is inevitable in adaptive systems. The goal is to detect it early — while it is still measurable, explainable, and correctable — rather than discovering it through incidents, audits, or post-mortems. Organisations that build this capability will be better positioned to deploy agentic AI at scale with confidence. Those that do not will continue to be surprised by systems that appeared stable, until they were not.

For consultants advising on enterprise AI deployments, the implication is practical: governance reviews that stop at policy documentation are incomplete. The question to ask is not just whether a client has an AI governance framework, but whether they can tell you how their agents are behaving today compared to three months ago. If the answer is silence, that is where the work begins.

Tag: ai-observability

Agentic AI Drift: The Silent Production Risk No One Is Measuring

The Problem with Demo-Driven Confidence

What Drift Actually Looks Like

From Policy to Diagnostics

What Practitioners Should Watch For

The Observability Gap