The procedural graph is not enough - Process Metronome Blog

Most organizations that deployed AI agents in operations in the past two years are now somewhere between disappointment and confusion. The technology clearly works in demonstrations. It clearly struggles in production. The gap between those two experiences is wide enough that many teams have quietly shelved their pilots, or reduced their AI ambitions to low-stakes tasks where a wrong answer costs little.

The standard explanations for this gap focus on the models: hallucination, unreliable reasoning, insufficient domain knowledge. Those are real problems. But they are symptoms of something more structural, and fixing them at the model level (by training on better data, building richer retrieval pipelines, or finetuning on domain-specific corpora) addresses the symptom without touching the cause.

The cause is a mismatch between what we ask LLMs to do and what they are architecturally suited to do. We ask them to navigate operational reality, to know what is true right now about a specific resource in a specific context, and to propose actions that are structurally valid given live constraints. The transformer architecture was not built for this. It was built to continue patterns in a distribution. When operational reality aligns with that distribution, the output looks correct. When it does not, the output is fluent, confident, and wrong.

Three recent developments frame where the industry is trying to go

Mikhail Gorelkin's essay From Hallucinations to Categorical Machines^[1] argues that hallucination is not merely an engineering flaw but a theoretical signal: the transformer preserves distributional regularities (syntax, register, compositional fluency) while failing to preserve truth-tracking as an intrinsic invariant.

Linear's latest post introduces their positioning towards AI^[2]. Their product is presented as a system designed not around handoffs but around shared context that both humans and agents can work from together, turning that context into execution. Mistral's Forge platform^[3] goes further inside the model: train the LLM on your proprietary ontologies, decision frameworks, and domain vocabulary so that domain knowledge is embedded in the weights from the start.

Each is a serious response to a real problem. And together they help locate what is still missing, because each leaves something unresolved. What none of them provides is a substrate that is sovereign over the AI's execution frame: that decides what the model sees, what actions it can propose, and what context it inherits, before any LLM invocation occurs.

That substrate is the operational graph. And the choice of whether it, or the LLM, holds structural authority is, in our view, the central design decision in AI-native operations.

The architecture most teams inherit

To see why, it helps to be precise about what the standard agent pattern actually does and where its structural limits lie.

In a procedural graph, a node is a function. It represents an action: call this tool, route to this branch, check this condition. The graph encodes a sequence. It does not change based on what domain you are in, what time it is, or what instance you are working on. The AI is layered on top, navigating this structure and reasoning its way to a conclusion.

Gorelkin's analysis^[1] helps clarify why this pattern is structurally fragile. He shows that the transformer's generative process is functorial with respect to form (it preserves compositional regularities, distributional continuity, stylistic coherence) but not with respect to correspondence to the world. When output aligns with operational reality, it is because truth correlated with plausibility at training time. The architecture did not enforce it. As he puts it: the system is not failing at its task; it is doing its task, which is high-dimensional semantic continuation, and that task does not include truth-tracking as an intrinsic constraint.

In knowledge work, this is tolerable. A wrong answer in a Slack thread gets corrected in the next message. In logistics or any industrial vertical, a dispatch agent that misreads a capacity constraint would have direct physical and pecuniary consequences. A scheduling agent that misses a dependency would shut down a production line. This is probably where the AI adoption is struggling in real world business environments. The cost of a well-reasoned but structurally wrong recommendation is not a correction in the next message.

Linear's advance: shared context replacing handoffs

Linear's announcement matters for reasons that extend beyond their product category. Software engineering, product management, and coding assistance represent the most advanced vertical for AI adoption today. What happens there tends to prefigure what will happen in other domains. When Linear names a structural shift, it is worth paying attention not only for what it says about software teams, but for what it signals about the direction of AI-integrated work more broadly.

Their argument is that issue tracking was built for a handoff model of software development: a PM scoped the work, engineers picked it up later, and the system filled with ceremony to bridge the gap. As agents absorb more of the procedural work, what matters is not the tracking of handoffs but the quality of the shared context that humans and agents work from together.

The data they cite supports it: coding agents installed in more than 75% of their enterprise workspaces, the volume of agent-completed work growing fivefold in three months, agents authoring nearly 25% of new issues. When the system holds the context (feedback, intent, decisions, code) it can route work to the right actor, whether human or AI, without the overhead of manual handoffs. Context turns into execution.

Linear is building this for product development teams, and doing it well. The principle validates a direction: systems designed around shared context outperform systems designed around handoffs. Because software engineering is the leading edge of AI adoption, this is likely the first place where the principle gets tested at scale.

Where the extension becomes non-trivial is in the nature of the context itself. In product development, context is largely textual: specifications, discussions, code, customer feedback. The structure is rich but relatively stable within a planning cycle. In physical operations, context is temporal, spatial, and constraint-laden. A barge's position changes. A lock schedule shifts. A storage level crosses a threshold. The shared context must be live, typed, and governed by structural rules that make invalid states inexpressible, not merely searchable.

Linear proves the direction. The question for operations is what kind of substrate makes that direction work when the constraints are physical and the cost of structural error is not a correction in the next sprint.

Forge's advance: domain knowledge in the model

Mistral's Forge platform makes a different and complementary case: that the right place to embed domain knowledge is inside the model itself. Through structured customization pipelines spanning pre-training, reinforcement learning, and finetuning, organizations can train models on proprietary data, ontologies, and decision frameworks. The result is a model that already knows your vocabulary, your constraints, and your reasoning patterns. Forge also provides serious evaluation infrastructure: KPI-aligned benchmarks, regression suites, drift detection, and version control, so the trained model is monitored and auditable over its lifecycle.

The appeal is genuine, and the improvement over a general-purpose model with a retrieval pipeline is real. A model trained on your domain will reason more fluently about it.

But fluency is not the same as structural grounding, and this is where it helps to separate two things that often get conflated: distributional competence and structural grounding.

Distributional competence asks: does the model understand your domain vocabulary and general reasoning patterns? Forge addresses this well. Structural grounding asks a different question: does the model know the current state of this specific resource, right now, in this specific instance?

An operational environment changes continuously. This is not an edge case; it is the substance of operations, the thing planners spend their day managing. A barge available at 06:00 is delayed at 08:00 because a lock on the Seine closed. A technician free this afternoon called in absent two hours ago. A storage level that was safe this morning crossed its critical threshold before the morning planning cycle ended.

A model whose domain knowledge lives in weights cannot know any of this. Not because the training pipeline is insufficient, but because weights encode a distribution over past observations, not a live model of the present moment. Finetuning happened last week, last quarter, or last year. The transient state of the operation is not in the weights and never can be. Forge's monitoring layer can detect when the model's priors drift from current reality, but at the moment of invocation, the model's execution frame is still assembled from its own internals, not projected from a live operational graph.

Gorelkin's framing of the problem^[1] applies here with particular force. He distinguishes between truth-preserving systems, where correctness is an invariant of the system's own composition at every intermediate step, and truth-filtered systems, which produce ungrounded structure and apply discipline after the fact. A Forge-trained model with a drift detection layer is a sophisticated truth-filtered system. The model generates from distributional priors; the monitoring layer catches regressions externally. What is missing is a substrate where the AI's operational frame is structurally grounded before generation begins.

Distributional competence without structural grounding produces confident, fluent, and wrong recommendations. The model knows what a capacity conflict looks like in general but does not know that this specific resource is already committed in this specific instance, because that fact lives in the operational state, not in the weights.

The question is not how much the LLM should know about your domain. It is where structural authority over operational reality should reside.

The failure you can't see

There are two ways an AI agent produces a wrong outcome, and they are not equally visible.

In the first case, the agent has correct context and reasons badly. This is detectable. You audit the reasoning against the known facts and find where it went wrong.

In the second case, the agent assembled incorrect context and reasoned perfectly within it. It navigated the knowledge graph, built what looked like the relevant picture, missed a dependency or misread a relationship, and delivered a logically sound recommendation from a flawed starting point. The output is coherent. The logic holds. The team follows it. The error surfaces hours later, when the barge is already at the wrong wharf.

Every system that delegates context assembly to the LLM, whether through graph navigation or through fine-tuned weights, carries this exposure. Not because the model is careless, but because reconstructing operational structure from retrieval results or trained priors is not what these architectures were designed to do. As Gorelkin argues, the system preserves the form of trustworthy text while failing to preserve the substance of trustworthiness. When you embed more domain knowledge in the weights, you give the model better priors. But priors can be stale, incomplete, or simply wrong for this instance, and the model will reason from them with the same confidence it brings to correct ones.

This second failure mode is the dangerous one precisely because it is invisible at the point of decision. The first kind of error (bad reasoning from good context) triggers skepticism. The second kind (good reasoning from bad context) produces trust. And misplaced trust in operations has a cost that compounds with every hour before discovery.

Where structural authority should reside

This is where we arrive at what we believe is the central design question for AI-native operations. Not: how do we make LLMs reason better about operational domains? But: how do we structure the substrate so that AI execution contexts are grounded in typed, live, graph-governed reality before the model is ever invoked?

The answer, in our view, is that structural authority must reside in an operational graph that is sovereign over the AI's execution frame. Not a knowledge graph that the AI queries. Not a set of domain priors embedded in the weights. A live structure that projects the complete local context (constraints, instance state, temporal position, process rules) into the execution frame at the moment of dispatch. The AI doesn't decide what to look at. The graph decides what the AI sees.

Gorelkin draws an analogy from quantum computing^[1] that sharpens this point: a quantum algorithm can be simulated on a classical computer and produce correct outputs, but the structural guarantees of the quantum regime do not transfer. The simulation preserves results, not the regime. The same logic applies here. You can surround a generative model with validation layers, retrieval pipelines, and monitoring infrastructure, and the outputs may be correct. But the regime is not truth-preserving. The correctness is contingent on the coverage of the filters, not guaranteed by the structure of the system.

When the graph holds structural authority, certain things change. Invalid operations are not checked and rejected after the AI proposes them; they are inexpressible in the projected toolset. Context is not assembled from retrieval results or recalled from trained priors; it is the current state of the graph at the moment of execution. The audit trail is not a log of LLM interactions layered on top of the process record; it is the process record, with AI and human actions in the same step instances, the same structure, the same timestamps.

The LLM provides reasoning. The graph provides structure. They should never swap roles.

Linear is building toward a system where context turns into execution for product teams, in the vertical that is furthest ahead in AI adoption. Forge is building toward models that understand enterprise domains from the inside. Both are doing important work. Both validate the direction.

The gap they leave open is the substrate itself: a typed, live, graph-governed structure where AI and automation are not layered on top of the operational model but are native expressions of it. Where constraints are not corrective filters applied after generation, but constitutive properties of the execution frame. Where structural authority over what the AI sees and what the AI can do resides in the graph, not in the model.

In Part 2, we describe what that structure looks like: what an operational graph node actually contains, why constraints must be constitutive rather than corrective, and what emerges when the graph is sovereign and sessions are derived.