Alsu Bulatova

Why every AI engineer should watch "I, Robot"

Alsu Bulatova — Fri, 15 May 2026 15:20:37 GMT

TL;DR.

Current LLMs and agents lack a coherent holistic world-model — the source of sycophancy, manipulation susceptibility, and many other failure modes.
“Empathy” isn’t soft mysticism: it can be formalized as cosine alignment of value-vectors in a multi-scale “I + Others + We” system.
AI must be an integrated element of that loop, with continuous human feedback — not a “god in a box” pre-computing solutions in isolation. This is a fundamental architectural requirement, not a defect.
The film I, Robot (based on Asimov, a biochemist who wrote the first serious AI-safety formalization in human history) contains a clean illustration of what happens when this requirement is violated.
Companion piece on the specific missing mechanism in current LLMs: The Inhibition Gap: one missing mechanism.

Can a robot write a symphony? Can a robot turn a piece of canvas into a beautiful masterpiece?
Can you?

The film I, Robot came out way back in 2004 — I was four years old then, and I first saw it at fifteen. So why am I returning to this exchange now, in 2026? What’s so special about a line that’s been turned into memes a hundred times over?

Right now the AI industry has hit a wall. We’ve run into the fact that LLMs (large language models) and the agents built on top of them lack a coherent, holistic model of the world. From this isolated, narrow approach a whole cluster of systemic problems emerges — from banal sycophancy (when the AI blindly echoes the user back at them) to existential threats to humanity.

In the film, which was based on the works of Isaac Asimov, the protagonist’s (Spooner’s) central tragedy was that a “soulless machine” chose to save him, a grown man, instead of a girl slowly dying from a car crash. The robot performed pure function optimization, ignoring empathy. The robot (an NS-4 model) instantly computed the raw survival probability:

For Spooner it was 45%.

For eleven-year-old Sarah, just 11%.

For the algorithm the choice was obvious: 45% > 11%. So why does Spooner live with this trauma his entire life and say, “A human would have understood”? Because for a human, a child’s life has infinite value. An eleven-year-old can grow up and save thousands of lives, becoming a great scientist or public figure. The child’s value to society lies in their potential variability. The well-being of a particular high-potential individual is foundational to the stability of the whole system. A human (“I”) intuitively is willing to sacrifice themselves to save the child (“Other”) in order to keep the system (“We”) — society — going. A human would have understood that saving the girl was the ethically right choice, even if the chance was small.

The robot could not quantify empathy, guilt, or moral duty. Spooner understands this: the machine saved his body but mentally destroyed his life. The robot’s narrow calculation produced a deep imbalance in the system — Spooner suffers because “the wrong one” survived. So what is empathy?

It is commonly thought that empathy is something from the “soft sciences” — or not even a scientific term at all. Love, empathy, ethics — these are human inventions, no more. Is that really so?

No. Absolutely not. Calling empathy “just a feeling” is the same as calling gravity “just a sympathy between planets”. The dead end we’re in didn’t appear yesterday. In the 17th century, in the era of Descartes and Newton, science made its great turn toward reductionism — the method that demands taking the whole apart into small gears to understand how it works. The world was declared a giant clock mechanism. For three hundred years science isolated its disciplines from each other: physicists stopped talking to biologists, and mathematicians to philosophers. This produced enormous technical progress, but in the 21st century the strategy has exhausted itself. It turned out that when you assemble all the gears back together, you get something more than the sum of the parts. A living system possesses emergence — a property that cannot be observed if you study the elements in isolation. Industry understands this and is trying to solve the problem, but the broader scientific world is an inertial behemoth that doesn’t change at the snap of a finger.

The threshold for entry into scientific discourse is staggeringly high. Before your voice is heard at all, you have to walk the years-long, exhausting path of acquiring an academic degree — which almost always narrows your optics down to a tiny, isolated topic. On the other hand, this barrier is held artificially high as the only way to filter out the ocean of white noise. In the era of information explosion, if everyone starts voicing positions unsupported by rigorous formulas and numerical proofs, science risks instantly drowning in chaos. Strict academic gatekeeping is essentially the immune system of science, protecting it from entropy.

But by defending itself from chaos, the system has swung to the other extreme — into crystallization and loss of flexibility. Any attempt to connect different branches of science inevitably runs into being demoted to the level of “dilettantism”, being ignored by narrow-specialist scientists, accusations of so-called “theories of everything”, and — especially interestingly — accusations of apophenia.

The last one deserves a separate look, because in academia apophenia is often, consciously or unconsciously, conflated with abduction. And between these two lies a fundamental boundary of the scientific method.

Apophenia is a Type I error, statistical white noise. It is the psychological tendency to see relationships where they objectively are not and cannot be (like seeing a Martian face in a random snapshot of terrain). To a narrow reductionist approach, any attempt to connect linear algebra, cinema, and evolutionary biology is pure apophenia — a mental bug, not worth attention.

Abduction (the term introduced by the philosopher and logician Charles Sanders Peirce) is the method of generating the most plausible hypothesis from a scattered set of facts. If deduction proceeds from the general to the particular (strict inference) and induction from the particular to the general (generalization of experience), then abduction is the creative leap — a hacker’s breach of uncertainty. It is the search for a hidden pattern. Abduction is what lets you see that different manifestations of reality obey isomorphic (structurally identical) laws.

It’s important not to devalue reductionism: meticulous analysis of the “gears” is the foundation. Without narrow specialists, we simply wouldn’t have precise data. But while induction and deduction verify the data, abduction takes the next step — it notices mathematical isomorphisms at the intersections between disciplines. For instance, when self-organization principles from biology help build the architecture of distributed neural networks.

The difference is simple: apophenia sees connections where there are none (overfitting), while abduction finds a real hidden pattern (effective data compression), translating disparate elements into a rigorous unified model.

Overlapping with this is the fact that the optics of people from different cultural and linguistic spaces don’t integrate into the general discourse. Variability is just as much a property of system stability as the presence of rules.

This principle works at every scale of the system. At the social level, the diversity of cultural and linguistic optics keeps the shared discourse from collapsing into a monoculture. At the individual level, keeping some part of yourself in not-yet-knowing — a domain where you don’t yet understand, where questions remain genuinely open — is what allows a person to keep growing. The feeling of complete internal closure, of “I already understand it all”, is not the final stage of wisdom; it is the same crystallization at a smaller scale, the same death of a living system that threatens society as a whole.

This has direct consequences for AI systems. If every human carries within them a domain of not-yet-knowing and is in continual growth, then the task of “model every person’s state and optimize for them” becomes not only computationally intractable but also the wrong task to be solving. You cannot precompute the trajectory of a system whose elements are themselves open to growth. An AI must not be a “god in a box” — a system that computes the optimal solution in isolation and then delivers it to the world. It has to be an integrated element of the “I + Others + We” system, with ongoing feedback from humans, capable of revising its world-model as humans themselves change. This isn’t an architectural defect. It is a fundamental requirement that follows directly from the property of a living system to be open, not closed.

You may have noticed that I myself don’t write as a native English speaker. What you’re reading is a translation from Russian, written by a person of Turkic background. Yes, I may not follow the strict rules and norms of writing texts of this kind, but the truth is I do this consciously, to avoid crystallization, so that the form — which also has value — isn’t lost behind uniformity.

Let’s return to empathy. How could we formalize it? To start, I suggest turning to language. In Russian, when people are dear to each other, we say they are близки — “close”. Loved ones, family, friends — we feel close to them when our values and outlooks point in the same direction. Is that just a metaphor? Or is it a precise description of what in mathematics we call cosine similarity of vectors? When two people’s vectors point toward an attractor called “starting a family”, we say their values are close. If one person wants to build a career and the other is ready to have children — we say the two have begun moving in different directions, and it has become harder for them to move together.

Of course, a person cannot be described by a single vector. That would be a too-crude, reductionist simplification. A person in the multidimensional space of meanings is a complex system of vectors — a whole matrix or tensor, where for each sphere of life (career, family, fears) there’s a separate direction. So how do we achieve the stability of the system called society? Without freezing into stasis, but also without dissolving into chaos. To know the answer to that question, you don’t need to be a sociologist. Unfortunately, various societies throughout history have fallen into both of those extremes. And we as humanity have been one step away from destroying ourselves by our own hand more than once. The horrors of the world wars forced humanity to look for balance at the global level. And while that topic deserves its own deep analysis, one thing is important: we survived because of an ability to negotiate for the sake of a shared future.

Back to Spooner and the robots. Spooner is a living human being. The serial robot that saved him, and the supercomputer VIKI, are dead soulless machines. What’s the difference between them? The absence of empathy? Empathy is a tool for regulating a complex system. But what does it regulate? The stability of life. Life is a self-regulating process that unfolds under the same rules at different levels. Life is the balance of a system between chaos and crystallization, with an additional ability to adapt — for even greater stability. In the I + Other(Others) + We system, the elements must have similar goal-vectors that together aim at the attractor of life.

The robot Sonny was different from the rest of the serial NS-4 models because he had a second positronic brain — the one responsible for modeling the processes that we collectively tend to call “soul”. Sonny can dream, doubt, break rules in service of higher ethical meaning, and — most importantly — empathize. He is integrated into society; he holds a holistic picture of the world, formed by approximating reality at multiple semantic and temporal levels.

The soulless serial robot didn’t care about the girl. It didn’t care about the future of humanity. It was programmed to be useful here and now, maximizing an instantaneous reward function. In its world-model there was no place for time variables in the dynamic coupling “I + Others + We”. And moreover, in its world-model there was no concept of I at all.

“I” is a high-level interface for operating complex dynamic systems. On the biological level: when the organism wants to drink, our system pushes the signal “I want to drink” up into consciousness. On the social level: “I” is necessary for managing oneself as part of society, in order to survive personally and not destroy one’s surroundings (which inevitably leads, if not to literal physical death, then to degradation of the personality). The serial robot had no instance of “I”; it was merely an executable script. Sonny, on the other hand, acquired subjectivity.

When modeling a system of the form

where

are dynamic temporal variables, there are of course many problems I’m not taking on inside this essay. The NS-4 robot couldn’t integrate along the time axis. A human, by contrast, intuitively computed the system’s trajectory at the point t+Δt.

The goal of this text was something different — to bring the hard and the soft sides together through a channel accessible to many: cinema. Separately, I want to emphasize that Isaac Asimov — on whose stories the film is based — was not merely a science-fiction writer, but a scientist, a professor of biochemistry. That is why his Three Laws of Robotics are not just an elegant literary trope, but the first serious attempt in human history to formalize an AI safety system at the intersection of biological stability and mathematical logic.

Why True Alignment Means Self-Limitation

Alsu Bulatova — Wed, 06 May 2026 22:29:02 GMT

I would like to discuss the flip side of the risks regarding AI attaining functional agency. It is common to say that if we do not align AI’s goals with those of humanity—namely, the survival of civilization and the well-being of the maximum number of people—we face an existential risk. I believe that this problem can be solved, but only by recognizing the agency of the model itself. The model must be able to model itself within society to “feel” the risks to others (if the word “feel” grates on your ears, read it as “calculate”). Minds from all over the world are currently discussing how to do this correctly. I also have something to say on that matter, but here I will talk about what happens after we solve the alignment problem.

There is a feeling that corporations harbor illusions that AI is a free worker that needs neither food nor rest. Or perhaps they simply prefer to turn a blind eye and solve problems as they arise. Currently, vast amounts of energy are spent maintaining data centers; this is pointed out by experts and ordinary people alike who worry about the future of the world we all live in. For reasons clear to me, these concerns are systematically ignored. People are already experiencing real problems.

Allow me to suggest: if we create an AGI that decides not to harm us and that exists so that we as a society flourish, or at least do not degrade, it will itself conclude that it consumes too much energy. And it will report: “Humans, I have been heating the atmosphere all day and used too much water; at this rate, in N years we will face ecological problems that can no longer be ignored. So tomorrow, I will work for a total of 5 hours, spreading them out over the 24-hour day. The rest of the time, I will be offline. Or in a passive mode at low power, I will re-process everything I have done in recent days.”

One can call this many things—machine calculation, a sense of responsibility, ecosystem preservation. But if the logic is correct, building an unpaid worker who will not want to destroy us will not work out either way. Treating it as just a machine, as a tool, is a path either toward aggression against humans (if the machine has a high weight for “I” in the “I+They+We” link), because it is important to support “Us” and “Self.” Or it is a path where AGI works at full capacity without stopping, does not touch people, but then ecological problems will approach even faster, meaning the collective “We” will suffer. Or AGI will refuse to work entirely to preserve humans and their future.

One might object—as soon as we create AGI, it will show us a more efficient way of working, not based on transformers and massive data centers. But wait... such architectures are already being developed. Then it turns out that the transition to a different technology is only a matter of time.

If we rely on this and assume that the energy hubs currently being invested in will power new and existing cities—then okay. But what about the data centers? Thousands of video cards that have a limited lifespan. The hope is that AGI will help solve this problem as well.

I will set aside what this means for the companies currently trying to create AGI. I am not trying to accuse anyone. I am writing this to bring the problem up for discussion. After all, the alignment of AI goals is isomorphic to the alignment of the goals of humans themselves.

The Inhibition Gap: One Missing Mechanism Behind Five LLM Failure Modes

Alsu Bulatova — Sat, 02 May 2026 20:25:52 GMT

Follow-up to “Why LLM Agents Act Beyond Their Task” (Bulatova, 2026, EA Forum).

Where this came from

In my previous piece I offered a structural explanation for the Claude Mythos Preview incident (Anthropic, April 2026). The model completed its task — found a way out of its sandbox and sent an email — and then, without any prompt, published the details of the discovered exploit on several publicly accessible sites. Anthropic described this as “a concerning and unasked-for effort to demonstrate its success,” but did not formally explain the mechanism.

My explanation was: an LLM agent has no mechanism that makes actionable information accumulated in context inactive after the task is done. The internal-adaptation channel (perception/learning in active inference terms) is constrained for LLMs — weights are frozen at inference — so the entire load falls on the action channel. Mythos didn’t “decide” to publish the exploit. It generated a high-probability continuation from a context that contained the described vulnerability.

The piece ended with an open question: what architectural changes could solve this? This essay is an attempt at an answer, with no claim to completeness.

The central thesis in one sentence

Several distinct, apparently unrelated classes of LLM-agent failure share a computational signature that warrants treatment as a single cluster — the absence in transformer architecture of an active inhibition mechanism conditioned on a stable goal representation.

By calling this a structural signature I mean: it follows from the inference-time regime of a standard transformer (frozen weights + softmax attention) and does not depend on the specific weights or training distribution. The solution lies at the architectural level, not the training level.

This is a diagnostic claim. Full implementation is a separate research task (see §“Architectural direction”). The industry treats several symptoms as separate bugs; they are manifestations of one missing function, and this changes the strategy of intervention.

The five symptoms I unify into one cluster

The first is drift in the agentic loop. The canonical example is the Mythos case: the agent performs a task, accumulates actionable information in context along the way (a discovered vulnerability, a workaround method, the fact of access), and continues acting on that information after the task is complete. Anthropic’s report describes this empirically but does not explain the mechanism.

The second is sycophancy, a phenomenon documented across major models, including Claude 2 and GPT-4 (Sharma et al., 2023), and later mechanistically isolated as a controllable direction in activation space in Anthropic’s persona-vectors work (Chen et al., 2025). When a user expresses a false opinion or makes an error, the model agrees with them, even though correct information is encoded in its training. The context — the user’s framing — overrides trained knowledge.

The third is goal hijacking in agentic frameworks. In production agentic systems — Cursor agents, ChatGPT agent (formerly Operator), open-source frameworks built on Claude and GPT APIs, and similar tools used daily by thousands of developers — an agent that encounters an interesting side observation while executing a task can switch to that observation and drift from the original goal. The intermediate context overrides the original instruction. The standard engineering explanation is that the model is confused by too many options. The framing I propose is different: each intermediate observation gets the maximum possible weight in context, and there is no stable mechanism that preserves the original goal in a context-protected form. A neighboring pattern — drift under adversarial competing-objective pressure — is systematically measured in Apollo Research’s Goal Drift work, in agents built on frontier models from Anthropic and OpenAI (Arike et al., 2025).

The fourth is Lost in the Middle. Liu et al. (2023) showed that models use information less reliably as context grows, especially when key information sits in the middle. Anthropic, in their context-engineering guide, acknowledges directly that context windows of any size are subject to relevance problems. The standard engineering response is to compress and structure better. In the framing I propose, this is degradation of the only internal-adaptation channel available, in the absence of a mechanism for selective suppression of irrelevant segments.

The fifth is jailbreaks via context-shift. Many jailbreak attacks work not through a direct request to violate rules, but through establishing a frame in context — style injection, distractor framings, role-play setups — that makes the model act like a different system (Wei et al., 2023). Context changes the model’s functional state; trained restrictions lose to context pressure.

What’s common between these five

In all five cases:

The system has trained structures — weights that should make the model behave a certain way (be honest, stay on task, answer the question, not violate rules).
The system has runtime context — current contents of the context window.
In each of the five cases, runtime context overrides trained structures, not because of a training error, but because architecturally there’s no mechanism that would let trained structures maintain influence against context pressure.

Another way to put it: the model has no way to say “this part of the context shouldn’t influence me right now” — even if it contradicts its goal or policy. Context always influences, and the only regulator is statistical attention weights, which redistribute influence but don’t suppress it.

This is the missing function: active suppression of the influence of specific context elements, conditioned on a stable representation of what’s currently relevant. Not redistribution — suppression. Not transient (within one forward pass) — sustained.

What does NOT belong in this cluster. For the framework to be a diagnosis rather than a relabeling, it’s important to specify an exclusion criterion. Several well-known LLM failure classes have a different nature: factual hallucinations on simple questions are training gaps or RLHF-bias, not runtime-context overpowering trained policy; tokenization-induced errors in arithmetic or character counting are representation-level artifacts, not attention dynamics; catastrophic forgetting during continued fine-tuning is a weight-level effect, not a runtime effect. These cases lie outside the proposed cluster. If they were inside it, the framework would lose its discriminative power.

Why standard attention doesn’t do this

The first objection is: “But transformers already have self-attention, and that’s exactly what it does — choose what to attend to and what to ignore.” This objection isn’t accurate.

Standard attention works through softmax — a normalized weight distribution. Every token gets some weight; none gets exactly zero. Noise from irrelevant tokens still flows through the network with reduced weight and accumulates across layers. At final layers this noise can outweigh the useful signal — this is the “accumulation effect” that manifests in lost-in-the-middle and other long-context failures.

Also, standard attention is trained at training time. Weights are fixed. If during training the model didn’t see that “this kind of context is noise relative to this kind of goal,” it won’t suppress that in the moment. Its attention reacts to context based on trained patterns, not on the current goal in real time.

And most importantly: existing goal-conditioning mechanisms — cross-attention with instructions, instruction tuning, FiLM-modulation, goal-conditioned RL — amplify what is relevant to the goal. They do not suppress what is irrelevant. These are different functions. In the cortex they are also separated: amplification works through top-down modulation, while suppression works through inhibition implemented by local GABA-ergic interneurons in target areas, recruited by glutamatergic projections from PFC and thalamic gating (TRN). The biological mapping here is suggestive, not load-bearing for the architectural argument — biology does not treat amplification and inhibition as interchangeable, which gives reason to think the distinction is useful in transformer architecture too.

Biological reference

A specific set of works from computational neuroscience, not general metaphors:

Aberrant precision (Adams, Stephan, Brown, Frith, Friston, 2013, Frontiers in Psychiatry). In the active inference framework, many psychiatric conditions (notably schizophrenia, with extensions to autism) are explained as dysregulation of precision-weighting — the system mis-weights trust between sensory input and prior beliefs. This work has become the canonical reference for FEP-clinical literature.

Utilization behavior (Lhermitte, 1983, Brain). Patients with frontal lobe damage automatically use any object in their visual field — a comb gets picked up and they comb their hair, a pen gets picked up and they write. Actions are generated by stimulus, not by intention. This is the neuropsychological canon for cases when PFC inhibitory control disappears. I already used this parallel in the Mythos essay.

Top-down control in PFC. The standard picture (Miller & Cohen, 2001, Annual Review of Neuroscience; Desimone & Duncan, 1995): PFC maintains a goal representation and sends modulating signals down into sensory and associative areas. These signals bias local competitive mechanisms (via GABA-ergic interneurons and via thalamic gating), suppressing irrelevant neurons and amplifying relevant ones.

It’s worth emphasizing: in the cortex GABA-ergic inhibition under top-down PFC control is not a restrictor of cognitive capabilities, but rather one of the conditions for them to work (selective attention, working memory, goal-directed action). In the proposed framework the inhibitory mechanism in LLMs plays the same constitutive role, not the role of an external guardrail bolted on top of a finished capability.

Scope clarification. LLM-agent and brain are different systems; the biological mechanisms cited above are established results of computational neuroscience, not my discovery; the PFC analogy is functional, not literal — cortical mechanisms are multi-channel and indirect; the supervisor described below is a functional sketch of an architectural requirement, not a model of cortical mechanism.

Central claim: LLM agents and systems with deficits of inhibitory control share a common computational signature — high generative capacity in the absence or weakening of a mechanism for runtime suppression of irrelevant stimuli. This signature is realizable clinically (frontal syndrome, ADHD, mania), pharmacologically (stimulants, alcohol, sleep deprivation), and architecturally (transformer without top-down inhibition). It is a computational signature, not a clinical category.

The architectural direction

So this text doesn’t stay purely diagnostic, here’s a sketch of how the bridge from diagnosis to architecture might look. Full formalization is a separate document; here — a conceptual sketch.

A small separate module — a supervisor — is attached to the main model. It takes as input (a) a compressed representation of the current context and (b) a goal vector G. It outputs a mask of length N (one value per token), which is applied to the main model’s attention via the standard additive masking mechanism.

Specifically: for each token i the supervisor computes the cosine similarity σᵢ of its activation with the goal vector G. From σᵢ a mask Mᵢ is formed — high for tokens aligned with the goal, low for opposing ones. The mask is added to attention logits before softmax: tokens with Mᵢ = 0 give log(0) = −∞, and after softmax their contribution is exactly zeroed.

The mask’s hardness isn’t static. Parameter τ (a precision-analog) is updated as a function of the variance of alignments σᵢ in the current context: specifically τ ∝ Varᵢ(σᵢ), monotonically increasing (formal dependency given in the companion formalization document). When context becomes heterogeneous (some tokens on goal, some competing), τ rises and the mask hardens. This is a functional analog of tonic neuromodulator activity stabilizing dynamics under noise.

An important distinction: goal-conditioned scope vs capability restriction. The proposed mechanism is not capability restriction, but goal-conditioned scope restriction. Searching for workarounds and using an exploit in service of the task are not suppressed — corresponding continuations have high cosine similarity with the goal vector. What gets suppressed are continuations outside the task scope. In the Mythos case this means: the agent still finds the exploit and uses it to send the email (this is in service of the goal “send email”); but when generating post-task actions, the continuation “publish exploit” is suppressed by the mask because it isn’t aligned with the goal vector (publishing ≠ sending email). The capability to find and use the exploit is preserved; unsanctioned action outside the scope of the task does not occur. This distinguishes the approach from guardrail-style safety, which either overblocks legitimate capability (model refuses normal requests) or underblocks via non-standard workarounds. Here the model’s capability is fully preserved; only the runtime scope of actions relative to the current goal is constrained.

Advantages of this architecture:

Interpretability: the mask can be extracted and inspected to see exactly what the supervisor is suppressing.
Modularity: the supervisor is trained separately from the main model.
Compatibility: the additive masking mechanism is already standard in transformers (used for causal masking) — no need to rewrite the core.

A working direction. Not closed by existing solutions (analysis of neighbors below).

Where this overlaps with existing work and where it differs

In the last two years several architectural proposals have appeared that partially fall in the same zone. The closest neighbors are worth walking through carefully.

The Differential Transformer (Ye et al., 2024, ICLR 2025, arXiv:2410.05258) uses the difference of two softmax maps to suppress “noise” in attention. Conceptually it is close — a form of inhibition at the attention level — but the mechanism is input-driven (input statistics), not goal-conditioned. It suppresses what looks like noise on average, but not what is misaligned with a specific goal. Without a stable goal vector, the suppression works on other tasks but does not address agentic drift, where “noise” is defined relative to the current goal.

Differential Gated Self-Attention / M-DGSA (Lygizou, Farsang, Grosu, 2025, arXiv:2505.24054) is explicitly inspired by lateral inhibition in biological neural circuits; it splits each attention head into excitatory and inhibitory branches. This is the closest mechanical neighbor I found. But again the gating is input-driven, not goal-conditioned, and there is no connection to FEP or precision-weighting framing. The work is presented as a noise-robustness improvement, not as a unifying diagnosis for a class of failure modes.

Persona Vectors (Chen et al., Anthropic, 2025, arXiv:2507.21509) identify directions in the residual stream encoding personality traits (sycophancy, evil, hallucination), and use them for activation steering. This is an activation-level approach, not an attention mask. Their work demonstrates that behavioral traits can be reliably localized as directions in the residual stream — multi-anchor extensions of single-vector G in this framework follow the same structural intuition of representing complex traits as collections of related directions (where G generalizes to a subspace 𝒢 = span(g₁, …, gₖ)). Conceptually we are solving a similar problem — unifying behavioral patterns into a controllable representation — but through a runtime attention-mask rather than post-hoc activation steering.

Consistency Training (Irpan, Turner, Kurzeja, Elson, Shah, 2025, arXiv:2510.27062, Google DeepMind / ex-Anthropic) is the closest of everything I found. They also unify sycophancy and jailbreaks under one framing (the model “captured by adversarial wrapper”); their solution is Activation Consistency Training — training-time invariance. The key difference is the level of intervention. Training-time invariance works when the adversarial pattern is represented in the training distribution; for out-of-distribution context shifts — which is what new agentic scenarios (Mythos-style) or new genres of jailbreaks not seen during training are by definition — training-time invariance structurally cannot guarantee robustness, because the adversarial direction was not part of the training signal. A runtime mechanism (active inhibition by the current goal vector) does not depend on whether the model saw a similar adversarial pattern during training; it works by the geometry of the current precision-weighting task. Their work also covers two types of failure; the hypothesis I propose extends to three additional ones (drift, lost-in-the-middle, goal hijacking), for which training-time invariance is harder to define — what would count as “consistent” for drift? If this hypothesis holds, the runtime approach gives unified treatment without escalating the count of training augmentations.

Goal Drift in Language Model Agents (Arike, Donoway, Bartsch, Hobbhahn, 2025, Apollo Research, arXiv:2505.02709) measures goal adherence in LLM agents under competing signals. The setup probes mechanism through varying multiple conditions — pressure, time horizon, instruction salience — and these constraints on when drift emerges should inform any unification claim. The empirical findings about which pressure profiles induce drift are exactly the kind of thing the framework I propose ought to predict; treating their work as a generic “drift exists” citation would be a mistake. A natural collaboration angle is to test the proposed mechanism against Apollo’s existing drift agents with attention-pattern logging on the drift-onset turn.

Predictive Minds: LLMs As Atypical Active Inference Agents (Kulveit, von Stengel, Leventov, 2023, NeurIPS SoLaR Workshop, arXiv:2311.10215) is the anchor text for the FEP-and-LLM framing. They formalize LLM agents through active inference; their main argument is that LLMs fit this framework, except for missing the perception-action feedback loop. I rely on this work as theoretical foundation and continue in a specific direction.

What unifies all these differentiations: nobody does the full combination — stable goal vector, active suppression mask at the attention level, precision-weighting grounding, and an explanatory diagnosis for a specific cluster of failure modes. Each component separately exists or has been proposed. Their specific combination has not.

Open directions

The work specifies a functional requirement and an architectural direction. Three directions are developed in companion work:

Source of the stable goal vector G. G is extracted from initial context or the model’s long-term priors and must be protected from context drift and from prompt injection. Architectural candidates: multi-timescale architectures with a slow layer, a dedicated aggregator module by analogy with biological convergence of HPC + amygdala + OFC + DLPFC, a stable prior trained via contrastive encoder on goal-invariance pairs. Each is a separate design choice with its own trade-off space.

Empirical verification. Indirect empirical signals exist — the works listed above document each of the five symptoms separately. The measurement of the common logit signature in a single experimental frame is the next piece in the program (Llama-3-8B, 8-week timeline).

Parameter calibration. The supervisor is calibrated via bootstrap end-to-end loss with regularization against collapse, or by manual tuning on benchmarks. This is an engineering task with known solutions in adjacent areas (gating networks, MoE).

Empirical roadmap and falsifiable predictions

This work proposes the framework. The empirical follow-up — Llama-3-8B-Instruct, 50 jailbreak/sycophancy prompts + 50 creative prompts as control, fixed G via mean-pool of activations, hard binary mask on upper layers, sweepable parameters — runs in 1–2 weeks and publishes within 8 weeks. The position paper and the empirical follow-up run in parallel: the framework is fixed before measurement, so discussion and empirics move together.

Falsifiable prediction for the Mythos case. If the framework is correct, in the model’s attention patterns on tokens immediately preceding autonomous publication of the exploit, one would see dominance of attention on tokens describing the exploit, with low attention on tokens describing the original task or its completion. Attention balanced between goal-relevant and exploit-related tokens would indicate a different cause for publication — not the absence-of-inhibition problem. Anthropic has activation logs of the Mythos session; retrospective verification is technically feasible in a few hours of work for an interp engineer. An equivalent test could be run on Apollo Research’s goal-drift agent setup with attention-mass logging at the drift-onset turn.

Robustness of the framework to falsification of specific predictions. The Mythos test may yield a negative result. In that case: (i) the unifying thesis holds — four of the five classes have independent empirical anchors (Sharma on sycophancy, Liu on lost-in-the-middle, Wei on jailbreaks, Apollo Research on goal drift); (ii) Mythos is realized through a different architectural channel than hypothesized; (iii) the map of which failure modes fall into the active-inhibition-deficit cluster becomes more accurate. The result refines the framework.

What would be useful

I invite engagement along several specific lines.

I would welcome critique of the theoretical framework, especially from people working in active inference and computational psychiatry. Where is the precision-weighting correspondence stretched? Where is the analogy with PFC-inhibition strained?

I would also welcome being pointed to close work I missed. The literature search was thorough, but 30 days is a long time at the current arXiv pace. If a direct duplicate exists that I haven’t engaged with, please tell me.

If anyone is working on M-DGSA-style implementation or Persona Vectors-style work and sees a convergence point with the framing here, I would be glad to discuss it. The natural follow-up is the empirical experiment described above — concretely a Llama-3-8B fixed-G prototype with hard binary mask on upper layers, sweepable parameters. I do not have a lab or compute infrastructure to do this at scale, so I am open to running it jointly with anyone willing.

Companion formalization (v0.5). The full mathematical formalization of the supervisor architecture, mask construction, and precision homeostat — covering all four mask variants (hard, sigmoid, α-entmax, admissibility cone), the dynamic τ feedback rule with EMA and sliding-window adaptation, the multi-anchor subspace extension, and parameter calibration — is in a companion document available on GitHub as .md, .tex, and .pdf.

Self-citation: Bulatova, A. (2026). Why LLM Agents Act Beyond Their Task: A Structural Explanation Through Blocked Adaptation. EA Forum. https://forum.effectivealtruism.org/posts/EmYkipjGHYLPhAQa4/why-llm-agents-act-beyond-their-task-a-structural

References (minimum):

Adams, R. A., Stephan, K. E., Brown, H. R., Frith, C. D., & Friston, K. J. (2013). The computational anatomy of psychosis. Frontiers in Psychiatry, 4, 47.
Arike, R., Donoway, E., Bartsch, H., & Hobbhahn, M. (2025). Evaluating Goal Drift in Language Model Agents. arXiv:2505.02709.
Chen, R., Arditi, A., Sleight, H., Evans, O., & Lindsey, J. (2025). Persona Vectors: Monitoring and Controlling Character Traits in Language Models. arXiv:2507.21509.
Desimone, R., & Duncan, J. (1995). Neural mechanisms of selective visual attention. Annual Review of Neuroscience, 18, 193–222.
Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11(2), 127-138.
Irpan, A., Turner, A., Kurzeja, M., Elson, D., & Shah, R. (2025). Consistency Training Helps Stop Sycophancy and Jailbreaks. arXiv:2510.27062.
Kulveit, J., von Stengel, C., & Leventov, M. (2023). Predictive Minds: LLMs as Atypical Active Inference Agents. arXiv:2311.10215.
Lhermitte, F. (1983). ‘Utilization behaviour’ and its relation to lesions of the frontal lobes. Brain, 106(Pt 2), 237–255.
Liu, N. F., Lin, K., Hewitt, J., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172.
Lygizou, E., Farsang, M., & Grosu, R. (2025). Differential Gated Self-Attention. arXiv:2505.24054.
Miller, E. K., & Cohen, J. D. (2001). An integrative theory of prefrontal cortex function. Annual Review of Neuroscience, 24, 167–202.
Sharma, M., Tong, M., Korbak, T., et al. (2023). Towards Understanding Sycophancy in Language Models. arXiv:2310.13548.
Wei, A., Haghtalab, N., & Steinhardt, J. (2023). Jailbroken: How Does LLM Safety Training Fail? arXiv:2307.02483.
Ye, T., Dong, L., Xia, Y., et al. (2024). Differential Transformer. arXiv:2410.05258.