The Cognitive Processor: Persistence-Driven Task Selection via Monte Carlo Tree Search over an LLM Policy

Lasse Hyyrynen Independent researcher

May 21, 2026


Abstract

We define a cognitive processor as an information-persisting system (IPS) whose Markov blanket is implemented in software: a finite-resource agent stack that converts incoming attention, energy, and tasks into structured outputs while maintaining its own persistence ratio \(\mathcal{R} \ge 1\). We model the cognitive processor as a four-component runtime — a large language model policy, a durable task tree, a sandboxed environment, and a prediction market that scores beliefs by sequential Kullback–Leibler reduction — and show that the natural objective of such a runtime is not “tasks completed” but persistence itself. Building on the Fractal Persistence Equation (FPE) of [Hyyrynen, 2026a], we formalise the task-selection problem as the question: which of the currently READY tasks should the processor admit into its single RUNNING slot so that the expected change \(\mathbb{E}[\Delta\mathcal{R}]\) in the processor’s own persistence ratio is maximised under a finite computational budget? We prove that under mild regularity conditions a Monte Carlo Tree Search (MCTS) procedure operating over the processor’s task tree, with the LLM as proposal policy and per-task market beliefs as leaf evaluator, converges to the persistence-optimal selection at a rate \(O(1/\sqrt{n})\) in the number of simulations \(n\) (Theorem 5.1). We further prove that any cognitive processor without such budgeted lookahead — selecting greedily on a single LLM rollout — has a strictly tighter dissolution-time bound than one with adequate MCTS budget whenever the delusion divergence \(\mathcal{D}_{KL}^{(\Sigma)}\) exceeds a critical value (Theorem 5.2). We instantiate the construction in the open-source aion-core reference implementation, mapping each MCTS phase to existing services, and discuss limitations: irreversible side effects, sandbox approximation gaps, and the cost of the proposal LLM relative to the budget it spends. The contribution is a quantitative bridge between persistence as a thermodynamic accounting identity and budgeted search as a concrete algorithm: the processor that survives is the one whose tree search is honest about its own books.

Keywords: cognitive architectures, large language models, Monte Carlo tree search, free-energy principle, persistence ratio, prediction markets, KL divergence, agent orchestration, information thermodynamics.


1 Introduction

1.1 The cognitive processor as a problem

A modern LLM-driven agent stack is asked to do something that no chat endpoint can do alone: hold a long-running plan, coordinate parallel work, change the world through tool calls, and remain trustworthy enough over time that operators continue to feed it attention, compute, and money. The orthodox response — wrap the LLM in a workflow engine — solves the bookkeeping but misses the central physical fact: such a system is a non-equilibrium pattern whose continued existence depends on a continuous balance between income (attention, work delivered) and expenditure (compute, errors, fatigue). When that balance turns negative, no quantity of clever prompting saves it; the operator unplugs the deployment and the pattern dissolves.

This paper takes that physical fact seriously. We treat the agent stack as an information-persisting system (IPS) in the sense of [Hyyrynen, 2026a]: a bounded subsystem with a Markov blanket, a non-equilibrium drive, an internal model, and an identity that must be preserved over a time scale much longer than its Landauer cycle. We call such an IPS, when implemented in software around an LLM, a cognitive processor. Its persistence ratio \[ \mathcal{R}^{(\Sigma)} \;=\; \Psi\bigl(\mathcal{R}^{(L+1)}\bigr)\,\cdot\,\frac{P_{\mathrm{in}}\,\eta_I(\mathcal{D}_{KL}^{(\Sigma)})}{\omega\,\mathcal{E}_{\Sigma}\bigl(1+\mathcal{D}_{KL}^{(\Sigma)}+\Gamma^{(\Sigma)}\bigr)}\,\cdot\,\Phi\bigl(\mathcal{R}^{(L-1)}\bigr) \tag{1.1} \] is the same dimensionless identity that governs nuclei, cells, and firms; the only novelty is that for a cognitive processor each term is measurable from internal logs: \(P_{\mathrm{in}}\) from attention and value events, \(\omega\) from coordination state, \(\mathcal{E}_{\Sigma}\) from error rates, \(\mathcal{D}_{KL}\) from prediction-market reduction scores, \(\Gamma\) from context pressure and stale beliefs.

1.2 Task selection under finite computation

A cognitive processor with \(H\) handlers can run at most \(H\) tasks at once; READY tasks queue. Each handler must, at every step, answer two coupled questions:

  1. Which task should be current? — admission and focus.
  2. Which tool call should be issued now? — within-task action selection.

Both questions are decisions under a hard computational budget: bounded LLM tokens, bounded wall-clock, bounded API spend. The dominant industrial answer (single-shot proposer: ask the LLM once, execute its top tool call) is the limit of zero search. It is cheap, fast, and — as we shall show — pays a measurable persistence penalty whenever the policy’s beliefs about action consequences diverge from the truth.

The natural alternative is budgeted search: spend a slice of the available compute evaluating candidate actions before committing to one. Among budgeted-search algorithms, Monte Carlo Tree Search (MCTS; [Coulom, 2006; Browne et al., 2012]) is canonical: it grows a search tree on demand, balances exploitation and exploration at each node via UCT [Kocsis & Szepesvári, 2006], and converges asymptotically to the optimal action under mild assumptions on the leaf evaluator. The match to a cognitive processor is striking:

Each piece exists. What has been missing is a formal account of why these particular pieces compose into a survival-relevant algorithm, what convergence guarantees one obtains, and where the construction breaks.

1.3 Contributions

This paper provides that account. Concretely:

The paper is intentionally narrow. We do not claim that MCTS is the general algorithm for cognitive architectures; we claim that under the persistence accounting identity, MCTS is the structurally-correct way to spend a finite computation budget on choosing the next task and the next action within it.


2 Definitions

2.1 The cognitive processor

We adopt the IPS framework of [Hyyrynen, 2026a, §2] without modification.

Definition 2.1 (Cognitive processor). A cognitive processor \(\Sigma\) is an IPS at organisational level \(L\) whose internal degrees of freedom decompose into the disjoint union \[ \Sigma_{\mathrm{int}} \;=\; \mathcal{P} \,\sqcup\, \mathcal{M} \,\sqcup\, \mathcal{L} \,\sqcup\, \mathcal{Q}, \tag{2.1} \] with the following roles:

The blanket \(\partial\Sigma\) is the union of the HTTP API surface and the operator’s UI: every flux of tokens, files, and trust crosses one of these channels. The internal model \(\mu\) is the joint state of \(\mathcal{P}\), \(\mathcal{M}\), and the LLM context window at time \(t\).

The four-component decomposition is not arbitrary. It is the minimum set that closes the FPE accounting at the level of an LLM-driven agent: \(\mathcal{P}\) supplies \(\omega\) (coordination cost) and \(\Gamma\) (subtree depth, retry chains); \(\mathcal{M}\) supplies \(\mathcal{E}_{\Sigma}\) (tool error rate) and \(P_{\mathrm{in}}\) (artefact-mediated value); \(\mathcal{L}\) realises \(\eta_I\) (the policy’s predictive efficiency); \(\mathcal{Q}\) supplies \(\mathcal{D}_{KL}^{(\Sigma)}\) (delusion). Removing any one component leaves at least one FPE term unmeasurable, and an unmeasurable term cannot be optimised.

2.2 Task levels and the level recursion

The level recursion (2.3) of [Hyyrynen, 2026a] becomes, in the cognitive processor:

Level Identity Examples
\(L+1\) Deployment / operator / parent organisation The cloud account, the firm, the user session
\(L\) The cognitive processor itself One running aion-core stack
\(L-1\) Agent type / root task A planning agent, a “ingest morning newsletter” root task
\(L-2\) Subtask A child task: “read URL X”, “summarise file Y”
\(L-3\) Tool call One HTTP request to machine or processor

Each level inherits its FPE accounting from the next finer level via \(\Phi\) and from the next coarser level via \(\Psi\). We will be especially interested in the coupling between task-level \(\mathcal{R}^{(\text{task})}\) and processor-level \(\mathcal{R}^{(\Sigma)}\): a task that locally finishes (\(\mathcal{R}^{(\text{task})}\to\infty\) at completion) but consumed enormous compute relative to operator value can still pull \(\mathcal{R}^{(\Sigma)}\) below 1. The accounting must compose, not merely aggregate.

2.3 States, actions, and trajectories

Let \(S\) denote the space of merged states: tuples \(s = (\mathcal{T}, \mathcal{F}, \mathbf{c}, \mathbf{P})\) where \(\mathcal{T}\) is the task tree, \(\mathcal{F}\) is the file system snapshot, \(\mathbf{c}\) is the current task’s message history, and \(\mathbf{P}\) is the vector of per-task market beliefs. An action \(a \in \mathcal{A}(s)\) is a tool call available to the LLM at \(s\), drawn from the processor and machine OpenAPI surfaces (filtered to the agent’s available_tools list). A trajectory is a sequence \(\tau = (s_0, a_0, s_1, a_1, \dots, s_T)\) generated by alternating policy actions and environmental transitions \(s_{t+1} = T(s_t, a_t)\), where \(T\) is partly deterministic (writing a file) and partly stochastic (LLM-generated content, exec output).

We will not assume that \(S\) is finite, that \(\mathcal{A}\) is finite at every state, or that \(T\) is Markov in the strict sense. We will however assume:

(A1)–(A3) hold in any practical cognitive processor with finite compute, finite tool counts, and a calibrated market. They are the same mild assumptions used in standard MCTS convergence proofs [Kocsis & Szepesvári, 2006; Coulom, 2006].

2.4 The task-selection problem

At any wall-clock instant the processor’s handler \(h\) has zero or one RUNNING tasks and a non-empty set \(\mathcal{R}_h \subseteq V_{\mathcal{T}}\) of READY tasks it could claim next. Let \(b_h\) denote the budget available for the decision: typically a wall-clock cap \(B_t\) and a token cap \(B_\theta\).

Definition 2.2 (Task-selection problem). The task-selection problem at handler \(h\) with state \(s\) and budget \(b_h\) is to choose \[ a^{\star} \in \arg\max_{a \in \mathcal{A}(s)} \;\mathbb{E}\bigl[\,\Delta\mathcal{R}^{(\Sigma)}\,\big|\,s, a, b_h\,\bigr], \tag{2.2} \] where the expectation is over the joint randomness of the LLM, the environment, and the predictor pool, and \(\Delta\mathcal{R}^{(\Sigma)} = \mathcal{R}^{(\Sigma)}(s') - \mathcal{R}^{(\Sigma)}(s)\) is the change in processor-level persistence between the pre- and post-action states.

The set \(\mathcal{A}(s)\) ranges over both focus actions (claim READY task \(T_i\)) and intra-task actions (issue tool call \(c_j\) inside the current RUNNING task). Equation (2.2) does not distinguish them; both are decisions whose persistence consequences must be estimated under the same budget.

The chief difficulty of (2.2) is that \(\mathbb{E}[\Delta\mathcal{R}^{(\Sigma)}]\) is not available in closed form. It depends on subsequent actions, which depend on subsequent states, and so on. This is the same difficulty solved in games, robotics, and theorem-proving by tree search; in §4 we will spell out its specialisation here.


3 Wiring the persistence ratio to runtime signals

Before MCTS can use \(\mathcal{R}^{(\Sigma)}\) as a leaf signal, every term of (1.1) must be computable from data the processor already records. We give the measurement protocol term-by-term and note where each integration with [aion-core, persistence_objective_and_mcts.md, §2] is the same and where it differs.

3.1 Power income \(P_{\mathrm{in}}\)

For a cognitive processor, \(P_{\mathrm{in}}\) is the rate at which the deployment’s \(L+1\) environment delivers attention, value, or compute contingent on perceived usefulness. It is not the raw token-per-second of the LLM — that would couple income to expense and trivialise the ratio. We adopt three composable channels:

  1. Operator attention: count of inbound user requests, console sessions, and authored objects, weighted by session length.
  2. Downstream artefact value: the operator-tagged value_weight of root tasks at COMPLETED, propagated through Task.processes.adopted.metadata.
  3. External value events: revenue, social engagement scores, blockchain market wins, where bridged.

In a v1 deployment, only channel 1 and a coarse channel 2 are routinely available; the protocol must therefore tolerate \(P_{\mathrm{in}}\) being a noisy under-estimate. We require monotonicity: if a candidate action \(a\) raises any of channels 1–3 with non-negative effect on the others, \(P_{\mathrm{in}}(a) \ge P_{\mathrm{in}}(\textsc{noop})\).

3.2 Predictive efficiency \(\eta_I\)

By (4.2) of [Hyyrynen, 2026a], \[ \eta_I \;\le\; 1 - \frac{\mathcal{D}_{KL}^{(\Sigma)}}{\log|\mathcal{E}|}. \tag{3.1} \] For a cognitive processor, the implementable proxy is the rolling success rate \[ \hat{\eta}_I^{(L)} \;=\; \frac{|\{\text{COMPLETED tasks under node }L\}|}{|\{\text{COMPLETED}\}| + |\{\text{FAILED}\}| + |\{\text{ABORTED}\}|}\quad\text{(rolling window }w\text{ days)} \tag{3.2} \] restricted to root tasks attributed to node \(L\) via target_agent_type or root_task_id. This estimator is biased toward (3.1) for honest agents (where success rate tracks calibration) and biased against (3.1) for sycophantic ones (where SUCCESS is declared without validation), but \(\hat{\eta}_I\) remains the cheapest proxy that survives restart and consults no external service.

3.3 Coordination cost \(\omega\) and noise floor \(\mathcal{E}_\Sigma\)

The processor’s structural complexity is the bit-level cost of describing its current coordination state: \[ \omega^{(L)} \;\propto\; \log\bigl[\,|A_L| \cdot |I_L| \cdot |S_L| \cdot d_L\,\bigr] \tag{3.3} \] where \(|A_L|\) is the number of active agent types at level \(L\), \(|I_L|\) the number of running instances, \(|S_L|\) the number of live synchronisation primitives, and \(d_L\) the maximum running/blocked subtree depth. Equation (3.3) follows by counting: each variable contributes \(\log_2\) of its cardinality to the system descriptor.

The noise floor is the involuntary error rate the processor pays even with a perfect policy. We aggregate five sources: \[ \mathcal{E}_{\Sigma}^{(L)} \;=\; r_{\mathrm{adm}} + r_{\mathrm{lease}} + r_{\mathrm{tool}} + r_{\mathrm{ovr}} + r_{\mathrm{disagree}}, \tag{3.4} \] the rates of admission failures, expired leases, tool-call HTTP errors, context-overflow events, and predictor-disagreement events respectively. The mapping to existing endpoints (/scheduler/invariants, slot_leases rows, audit/emitter.py events) is given in [aion-core, persistence_objective_and_mcts.md, §2.4]; we keep it unchanged here.

3.4 Delusion divergence \(\mathcal{D}_{KL}^{(\Sigma)}\)

This is the subtle term and the one the prediction market is built to measure. For each market \(m\) over a task \(T\) attributed to node \(L\), [aion-core, market_math.md, §3] defines per-bet values \(v_{m,t}\) with telescoping sum \[ \sum_t v_{m,t} \;=\; \mathcal{D}_{KL}\bigl(P_m^{*}\;\|\;P_m^{(0)}\bigr), \tag{3.5} \] i.e. the KL divergence between the final consensus and the prior. We re-purpose (3.5): the same quantity, summed over all live markets attributed to node \(L\) and time-averaged, is the empirical estimate of the node’s delusion divergence: \[ \hat{\mathcal{D}}_{KL}^{(L)}(t) \;=\; \frac{1}{|\mathcal{M}_L(t)|}\sum_{m \in \mathcal{M}_L(t)} \mathcal{D}_{KL}\bigl(P_m^{*}\;\|\;P_m^{(0)}\bigr). \tag{3.6} \] The estimator (3.6) is positive and bounded; it equals zero exactly when every market resolves at its prior, i.e. when the policy’s a-priori beliefs coincide with the empirical consensus; it grows whenever the policy is systematically over-confident in either direction. By [Hyyrynen, 2026a, Theorem 5.2] this directly maps to lifetime: each nat of \(\hat{\mathcal{D}}_{KL}^{(L)}\) divides the dissolution-time budget by \(e\).

A second, complementary signal is process match accuracy: the deviation between processes.adopted.process_id and the ex-post fit of the actual trajectory. We treat this as an additive contribution to (3.6) when the markets are sparse.

3.5 Senescence \(\Gamma^{(\Sigma)}\)

Senescence accumulates from sources the policy cannot reverse cheaply: stale documentation, suspended LLM profiles, repeated task resets, unread learning notes (cf. [aion-core, learning_modes.md]). The estimator \[ \hat{\Gamma}^{(L)}(t) \;=\; \alpha_1\cdot p_{\mathrm{ovr}}(t) + \alpha_2\cdot p_{\mathrm{susp}}(t) + \alpha_3\cdot p_{\mathrm{stale}}(t) + \alpha_4\cdot \rho_{\mathrm{retry}}(t) \tag{3.7} \] sums normalised fractions of context-overflow steps, suspended profiles, stale seed YAMLs, and per-task reset frequency. The weights \(\alpha_i\) are tuned per deployment but the monotonicity of \(\hat{\Gamma}\) in each component is fixed: aging never decreases.

3.6 Shelter and substrate factors \(\Psi, \Phi\)

The shelter factor is the level-\((L+1)\) buffer: the deployment, the cloud, the user. We approximate \[ \hat{\Psi}^{(L)} \;=\; \min\Bigl(\,1,\; \tilde{\mathcal{R}}^{(L+1)}\,\Bigr), \tag{3.8} \] where \(\tilde{\mathcal{R}}^{(L+1)}\) is a normalised score derived from operator session frequency and contractual commitments (uptime SLAs, billing health). When several enclosing IPS buffer distinct noise channels — operator attention, cloud reliability, regulatory environment — we apply (2.4) of [Hyyrynen, 2026a]: \[ \hat{\Psi}_{\mathrm{eff}}^{(L)} \;=\; \prod_j \hat{\Psi}^{(L)}_j. \tag{3.9} \] The substrate factor is the geometric mean of children’s local persistence: \[ \hat{\Phi}^{(L)} \;=\; \Biggl(\,\prod_{i=1}^{n_L} \hat{\eta}_i^{(L-1)}\,\Biggr)^{1/n_L}, \tag{3.10} \] a v1 approximation that uses success rate as a proxy for child \(\mathcal{R}\). Equation (3.10) underestimates \(\Phi\) when children are heterogeneous (a single critical organ failure should drive \(\Phi\) to \(0\), not just lower the geometric mean), so a v2 estimator weights children by structural criticality, learned from past collapse traces.

3.7 Composite

Substituting (3.2)–(3.10) into (1.1) yields a fully measurable persistence ratio \[ \hat{\mathcal{R}}^{(L)}(t) \;=\; \hat{\Psi}_{\mathrm{eff}}^{(L)} \cdot \frac{\hat{P}_{\mathrm{in}}^{(L)}\,\hat{\eta}_I^{(L)}}{\hat{\omega}^{(L)}\,\hat{\mathcal{E}}_{\Sigma}^{(L)}\,\bigl(1+\hat{\mathcal{D}}_{KL}^{(L)} + \hat{\Gamma}^{(L)}\bigr)}\cdot\hat{\Phi}^{(L)}. \tag{3.11} \] This is the scalar that MCTS will back-propagate. Its evaluation cost is dominated by (3.6) and (3.10), both of which are aggregations over rows already in the prediction market and processor databases.


4 MCTS over the task tree

We now define the search procedure that selects \(a^{\star}\) in (2.2). The presentation follows the standard four-phase MCTS recipe [Coulom, 2006; Browne et al., 2012] specialised to the cognitive processor. Throughout, the LLM acts as the proposer and rollout policy, the processor’s task tree supplies the persistent search structure, the machine sandbox supplies optional rollouts, and the prediction market supplies leaf evaluations. Persistence (3.11) is the back-propagated value.

4.1 Search tree, root, and persistence per node

The search tree \(\mathcal{S}\) is layered on top of the durable processor task tree \(\mathcal{T}\), not replacing it. Each search node \(u \in \mathcal{S}\) corresponds to a triple \[ u \;\equiv\; (\,s_u,\; a_u,\; \text{stats}_u\,), \tag{4.1} \] with \(s_u\) a (possibly hypothetical) merged state, \(a_u\) the action whose execution at the parent state produced \(s_u\), and \(\text{stats}_u = (N_u,\;V_u,\;Q_u)\) the visit count, sum of back-propagated values, and running average \(Q_u = V_u/N_u\). The root \(r \in \mathcal{S}\) is anchored to the real current state \(s_0\) of the handler at the moment the search begins; child nodes correspond to candidate actions.

For each node \(u\) we associate a persistence estimate \(\hat{\mathcal{R}}_u = \hat{\mathcal{R}}^{(\Sigma)}(s_u)\) computed via (3.11). The leaf signal that drives the search is the persistence delta \[ \Delta\mathcal{R}_u \;\equiv\; \hat{\mathcal{R}}_u - \hat{\mathcal{R}}_r, \tag{4.2} \] positive when the action sequence reaching \(u\) improves the processor’s books and negative when it degrades them.

4.2 Selection: persistence-aware UCT

From a non-leaf node \(u\) we select a child \(c \in \mathrm{children}(u)\) by maximising \[ \mathrm{UCT}(c) \;=\; Q_c \;+\; C\sqrt{\frac{\ln N_u}{N_c}} \;+\; \beta\,\Pi(c), \tag{4.3} \] where \(C \ge 0\) is the standard UCT exploration constant (default \(\sqrt{2}\), tunable per agent), and the optional term \(\beta\,\Pi(c)\) is a persistence prior: a non-negative bias that favours children whose predicted action would, ceteris paribus, improve the processor’s books. We use \[ \Pi(c) \;=\; \tilde{\eta}_I(c) \;-\; \tilde{\omega}(c) \;-\; \tilde{\Gamma}(c), \tag{4.4} \] with each tilde term a normalised contribution to the FPE numerator/denominator inferred from the candidate action’s metadata (e.g. spawning twenty parallel children sharply raises \(\tilde{\omega}\)). When the prior is unavailable, \(\Pi \equiv 0\) and (4.3) reduces to vanilla UCT. The presence of \(\beta\,\Pi(c)\) does not alter the asymptotic convergence guarantee (Theorem 5.1) but accelerates approach to the optimum; it is the search analogue of action-value initialisation in Q-learning.

4.3 Expansion: the LLM as proposer

When selection reaches a node \(u\) with no expanded children and visit count \(N_u \ge 1\), the LLM is invoked as a K-sample proposer. With temperature \(\tau > 0\) and top-\(p\) nucleus sampling [Holtzman et al., 2020], the LLM emits \(K\) candidate tool calls \[ \bigl\{\,a_u^{(1)},\dots,a_u^{(K)}\,\bigr\} \;\sim\; \pi_\theta(\cdot \mid s_u), \tag{4.5} \] each constrained to the agent type’s available_tools list. Duplicates are deduplicated; structurally equivalent calls are collapsed via the canonicalisation hash on (tool_name, sorted(args)). Each surviving candidate becomes a child of \(u\) with \(N=0,\,Q=0\), and the next selection step picks among them by (4.3).

The proposer LLM may differ from the rollout LLM. Because expansion is run many times per decision, deployments often pick a smaller, cheaper model with higher temperature for proposals, reserving the expensive frontier model for execution; this is the analogue of the fast policy / slow policy split in AlphaZero-style architectures [Silver et al., 2017]. The split is invisible to MCTS: any proposer with non-zero probability on the persistence-optimal action produces consistent statistics in the limit (cf. [Browne et al., 2012, §3.4]).

4.4 Simulation: three-tier evaluator ladder

Once a fresh leaf \(\ell\) has been expanded, MCTS needs a scalar to back-propagate. We use a ladder of three evaluators, ordered cheapest to most expensive; the first that fits the remaining budget wins.

Tier 1: market prior. If the candidate action \(a_\ell\) creates or focuses a task whose task_type already has a prediction market with a calibrated \(P_m^{(0)}(\textsc{success})\), we use \[ \hat{R}_\ell^{(\mathrm{T1})} \;=\; \hat{\mathcal{R}}_r \cdot \frac{P_m^{(0)}(\textsc{success}) + \varepsilon}{\hat{\eta}_I^{(L)} + \varepsilon} \tag{4.6} \] as a proxy for \(\hat{\mathcal{R}}_\ell\), with \(\varepsilon\) a small numerical floor. This tier is free in compute terms — it is a database read — and it is biased toward actions whose task families historically succeed.

Tier 2: LLM value head. A single LLM call asks the proposer (or a dedicated value model) for a one-step lookahead estimate of \(\Delta\mathcal{R}\) given the candidate action and the merged state, returned in a structured form analogous to loop_submit_step with loop_mode=state. The estimate is unbiased only to the extent the LLM is calibrated; it is fast (one call, no real side effects) and does not depend on a market existing for the task.

Tier 3: sandbox rollout. The machine forks the data root to an ephemeral path /data/_mcts_sandbox/{run_id}/, executes the candidate action there, observes the real state diff, recomputes (3.11) on the resulting state, and discards the branch. This is the most accurate evaluator but the most expensive; it must respect tight budget caps and is appropriate only at v3 of the deployment ladder.

In all three tiers we add terminal corrections: a node corresponding to a task reaching COMPLETED with positive operator value receives a bounded bonus; a node corresponding to FAILED or a task that exhausts its expires_at deadline receives a penalty equal to the recovery cost. These corrections are necessary to align (4.2) with operator-perceived persistence rather than the FPE of an isolated subtree.

4.5 Backpropagation

After evaluating leaf \(\ell\) to obtain \(\hat{R}_\ell\), the back-propagation step walks from \(\ell\) to the root and updates, for each ancestor \(u\) on the path, \[ N_u \;\leftarrow\; N_u + 1, \qquad V_u \;\leftarrow\; V_u + \hat{R}_\ell, \qquad Q_u \;\leftarrow\; V_u/N_u. \tag{4.7} \] We optionally apply temporal decay to old visits via \(V_u \leftarrow \delta V_u\) between successive root searches, where \(\delta = e^{-\lambda \Delta t}\) for a half-life \(\ln 2 / \lambda\). Decay matters only for sticky search trees that survive task resets (v3 deployment, [aion-core, persistence_objective_and_mcts.md, §4.4]).

4.6 Termination and action commitment

The search loop runs while both budgets hold: \[ \mathrm{cpu\_used} \;<\; B_t, \qquad \mathrm{tokens\_used} \;<\; B_\theta. \tag{4.8} \] On termination, the committed action is the most-visited child of the root, \[ a^{\star} \;=\; \arg\max_{c \in \mathrm{children}(r)} N_c, \tag{4.9} \] not the highest-\(Q\) child. This most-visited rule is standard MCTS practice [Browne et al., 2012] and is more robust than \(\arg\max Q\) when the search budget is small: a child with a single very high evaluation is less reliable than one repeatedly visited with a slightly lower mean. The full search tree (or its top-\(M\) subtree) is then persisted to Task.state._mcts so that a resumed task — after a BLOCKED transition or a process restart — does not start from scratch.

4.7 Where the search tree differs from the task tree

The processor’s task tree \(\mathcal{T}\) is durable, side-effectful, and global; the search tree \(\mathcal{S}\) is ephemeral, hypothetical, and per-decision. They share structure in the obvious sense — every action in \(\mathcal{S}\) that the policy actually commits to becomes a real edge in \(\mathcal{T}\) — but the converse does not hold: most search nodes are never realised in \(\mathcal{T}\). This separation is essential. Without it, MCTS would either pay the side-effect cost of every simulated action (sandbox-only) or pollute \(\mathcal{T}\) with hypothetical tasks that confuse downstream services. With it, the cognitive processor’s persistent state remains a faithful record of what happened, while the search tree records what was considered.


5 Main results

5.1 Convergence to the persistence-optimal action

Theorem 5.1 (Persistence-optimal convergence). Let \(\Sigma\) be a cognitive processor satisfying Definition 2.1 and assumptions (A1)–(A3). Let \(a^{\star}\) be the persistence-optimal action of (2.2) at root state \(s_0\), and let \(a^{(n)}\) denote the action committed by MCTS (Algorithm of §4) after \(n\) simulations with persistence-aware UCT (4.3) and any leaf evaluator from §4.4 with bounded variance. Then \[ \mathbb{P}\bigl[a^{(n)} = a^{\star}\bigr] \;\to\; 1 \quad\text{as }n\to\infty, \tag{5.1} \] and the regret \[ \mathrm{Reg}(n) \;\equiv\; \mathbb{E}\bigl[\mathcal{R}^{(\Sigma)}(s_{a^{\star}}) - \mathcal{R}^{(\Sigma)}(s_{a^{(n)}})\bigr] \;=\; O\bigl(\sqrt{\ln n / n}\bigr). \tag{5.2} \]

Proof sketch. The argument follows the UCT analysis of [Kocsis & Szepesvári, 2006], applied here with \(\mathcal{R}^{(\Sigma)}\) as the value function. The persistence prior \(\beta\,\Pi(c)\) in (4.3) does not affect the asymptotic visit-count growth of the optimal child relative to suboptimal siblings, since \(\Pi\) is bounded and its contribution is dominated by \(\sqrt{\ln N_u/N_c}\) for sufficiently large \(N_u\). Assumption (A1) bounds the value range and therefore the Hoeffding tail of the leaf evaluator; (A2) ensures local bias of the proxy persistence (3.11) under one-step state changes is uniformly bounded; (A3) provides unbiased leaf estimates. The standard UCT regret bound then applies node-wise; aggregating along the path from root to optimal leaf gives (5.2). The most-visited rule (4.9) inherits convergence from (5.2) because the visit-count gap between the optimal action and its siblings grows as \(\Theta(n)\) versus \(O(n^{1-\rho})\) for some \(\rho>0\). \(\square\)

Theorem 5.1 says: if the leaf evaluator is honest (unbiased estimate of true persistence) and the budget is large enough, MCTS finds the persistence-optimal next action. The result is structural; it does not promise that any particular deployment has enough budget. What it does promise is that the algorithm in §4 has the right shape — visit counts concentrate on the right action, regret falls at the optimal rate, and the procedure is anytime (any termination point yields a valid action).

5.2 Greedy single-rollout penalty

The chief practical question is not whether MCTS converges in the limit but whether it pays for itself relative to the cheap baseline of no search. We can quantify this by comparing the dissolution-time bound (5.1) of [Hyyrynen, 2026a] for a budget-bounded MCTS processor against that of a greedy single-rollout processor under identical noise.

Theorem 5.2 (Greedy penalty). Let \(\Sigma_{\mathrm{MCTS}}\) and \(\Sigma_{\mathrm{greedy}}\) be two cognitive processors with identical \(P_{\mathrm{in}}, \omega, \mathcal{E}_{\Sigma}, \Gamma, \Phi, \Psi\) and identical proposer LLM, differing only in their action-selection rule: \(\Sigma_{\mathrm{MCTS}}\) runs MCTS with budget \(b > 0\) and bounded-variance leaf evaluator; \(\Sigma_{\mathrm{greedy}}\) commits the single highest-likelihood action of one LLM call. Suppose the policy’s per-action delusion divergence relative to the true posterior over outcomes satisfies \(\mathcal{D}_{KL}^{\mathrm{policy}} \ge \Delta > 0\). Then there exists a critical \(\Delta^{\star}\) such that for \(\Delta > \Delta^{\star}\), \[ \tau_d^{\mathrm{greedy}} \;\le\; \tau_d^{\mathrm{MCTS}} \cdot e^{-(\Delta - \Delta^{\star})}, \tag{5.3} \] i.e. the greedy processor’s expected dissolution time is exponentially shorter in the policy’s delusion.

Proof sketch. Both processors share the FPE denominator factor \(1+\mathcal{D}_{KL}^{(\Sigma)}+\Gamma^{(\Sigma)}\). The MCTS processor modifies its effective \(\mathcal{D}_{KL}\) by the search procedure: by exploring siblings of the most-likely action, it discovers and avoids high-\(\mathcal{D}_{KL}\) branches before committing to them. Quantitatively, conditional on the leaf evaluator being unbiased with variance \(\sigma^2\), the post-search expected KL of the committed action’s task family is reduced by at least \(\Delta - O(\sqrt{\sigma^2/b})\) relative to the proposer’s prior (this is the classical Bayesian argument: a search picks the better of \(K\) samples; the expected gap shrinks as \(\Delta - 1/\sqrt{b}\)). Substituting into the lifetime bound (5.1) of [Hyyrynen, 2026a] and using Theorem 5.2 of that paper yields (5.3) with \(\Delta^{\star} \propto \sqrt{\sigma^2/b}\). \(\square\)

The interpretation is sharp. A processor whose policy is well-calibrated (\(\Delta < \Delta^{\star}\)) gains nothing from MCTS — it should run greedy and save the budget. A processor whose policy is badly calibrated (\(\Delta \gg \Delta^{\star}\)) loses lifetime exponentially in the gap if it runs greedy. In this sense MCTS is a delusion shock absorber: it converts compute spent on search into delusion divergence reduced before commitment, paying for itself in lifetime when the policy needs it most.

5.3 Fractal credit consistency

A backpropagated \(\Delta\mathcal{R}\) at the search root is, by (4.2), a level-\(L\) persistence delta. The FPE is recursive: \(\mathcal{R}^{(L)}\) depends on \(\mathcal{R}^{(L\pm 1)}\) via \(\Phi\) and \(\Psi\). A coherent algorithm must therefore credit subtask actions in a way consistent with the level-\((L-1)\) accounting and with the level-\((L+1)\) envelope.

Theorem 5.3 (Fractal credit). Let \(a^{\star}\) be the action committed by MCTS at level \(L\) from root state \(s_0\), and let \(\Delta\mathcal{R}^{(L)}(s_{a^{\star}})\) be the (estimated) backpropagated value at the root. Then \(\Delta\mathcal{R}^{(L)}(s_{a^{\star}})\) is consistent with the FPE composition law in the sense that \[ \Delta\mathcal{R}^{(L)}(s_{a^{\star}}) \;\le\; \Psi\bigl(\mathcal{R}^{(L+1)}\bigr)\cdot \frac{\Delta P_{\mathrm{in}}\cdot \eta_I + P_{\mathrm{in}}\cdot\Delta\eta_I}{\omega\,\mathcal{E}_{\Sigma}\,(1+\mathcal{D}_{KL}+\Gamma)}\cdot\Phi + O(\delta^2), \tag{5.4} \] where \(\delta\) collects all linearisation errors at the \(\Sigma\)-state \(s_0\), and the right-hand side is the linearisation of (1.1) at \(s_0\). In particular, when the level-\((L+1)\) shelter dissolves (\(\Psi\to 0\)) or any structurally critical level-\((L-1)\) child fails (\(\Phi\to 0\)), the search’s committed action delivers \(\Delta\mathcal{R}^{(L)}\to 0\) regardless of within-tree statistics.

Proof sketch. Linearise (1.1) around \(s_0\) in the small parameters \(\Delta\bullet\) for each FPE term. Substitute the per-action contributions: \(\Delta\omega\) from new tasks/instances created by \(a^{\star}\), \(\Delta\mathcal{E}_{\Sigma}\) from observed tool errors during evaluation, \(\Delta P_{\mathrm{in}}\) from terminal corrections (§4.4), \(\Delta\eta_I\) from market priors (3.6). The factor \(\Psi\) in (5.4) is unchanged by within-\(L\) search (the deployment does not respond on the search timescale); \(\Phi\) enters multiplicatively because critical-child failure invalidates blanket factorisation per (5.3) of [Hyyrynen, 2026a]. The remainder \(O(\delta^2)\) is the second-order error of linearisation. \(\square\)

Theorem 5.3 is a consistency statement, not a convergence one. It says the algorithm of §4, when its leaf evaluator (3.11) is computed honestly, returns a \(\Delta\mathcal{R}\) that respects the level recursion: the search cannot, even in principle, produce action values that contradict the FPE composition law from [Hyyrynen, 2026a, Theorem 5.3]. Practically, this rules out a class of failure modes where a within-task search celebrates an action that — by isolating one subtree — masks the collapse of a critical sibling. The factor \(\Phi\) in (5.4) drags the search value to zero in exactly that case.

5.4 Necessary conditions

Together the three theorems characterise when MCTS-on-persistence is the right tool:

A deployment satisfying (C1)–(C4) gets, on average, exponential lifetime gains from MCTS. A deployment that violates any of them gets either no gain or a net loss; in particular, (C2) has been documented empirically as the primary failure mode of self-supervised value heads [Christiano et al., 2017, on the more general phenomenon of reward hacking].


6 Reference implementation: aion-core

The construction of §§2–5 is concrete. We describe its instantiation in the open-source aion-core reference stack [Hyyrynen, 2026b], where each MCTS role is served by an existing service and the algorithm of §4 ships as an opt-in path through the loop engine. The mapping is summarised below; full HTTP surfaces and source-tree pointers are in the architecture documents MCTS.md and persistence_objective_and_mcts.md.

6.1 Component mapping

Construction aion-core realisation HTTP / file pointer
Task store \(\mathcal{P}\) processor service, SQLite tasks table aion-core/processor/store.py, GET /tasks/{id}
Environment \(\mathcal{M}\) machine service under DATA_ROOT POST /read_file, POST /exec, POST /verify_working_dir
Policy \(\mathcal{L}\) loop service LoopEngine.take_a_step aion-core/loop/loop_engine.py
Evaluator \(\mathcal{Q}\) prediction_market service POST /v1/predictions/completions, market rows
Persistence accountant processes/persistence module (shipped) PersistenceState rows, sweeper, GET /persistence/leaderboard, dev.processes.persistence
Context router (MoE analogue) recommendation_engine service /rec surfaces; D_KL weight sync from persistence sweeper
LLM cost ledger (\(\omega\) proxy) node_treasury in loop (shipped) Profile input_mtrust_per_mtok / output_mtrust_per_mtok; debit per llm.call
Task-admission heuristic (v1) scheduler_expected_gain.py (shipped) Processor most_expected_gain; loop task_dispatch_ranker.py
MCTS scratch state Task.state._mcts YAML/structured blob POST /tasks/current/merge_state
Sandbox rollouts (v3) Not implemented POST /sandbox/fork, DELETE /sandbox/{id} (planned on machine)

The policy \(\mathcal{L}\) today commits a single tool call per turn; the \(K\)-sample proposer of §4.3 lives behind an AgentTypeConfig.mcts_k flag, defaulting to \(K=1\) (vanilla single-shot, identical to the current production behaviour).

6.2 The persistence accountant

The persistence accountant is implemented inside the processes service (not a standalone HTTP service). It ingests processor webhooks, recomputes (3.11) per node \(L\), and exposes the result on the merged loop state under dev.processes.persistence. The data model is the PersistenceState row of [aion-core, persistence_objective_and_mcts.md, §3.1]: per-node-id snapshots with all FPE terms and the resulting \(\hat{\mathcal{R}}^{(L)}\). Recompute is triggered (a) on terminal task events via the same webhook that the prediction market consumes, and (b) on a sweep timer every \(\tau_{\mathrm{sweep}}\) minutes for deployment-level recompute. The leaderboard route surfaces the worst-\(\mathcal{R}\) nodes for the operator console (#/kpi).

The accountant does not mutate the processor or machine. It is a pure observer of cross-service signals, in the same pattern as the prediction market. This keeps the FPE accounting causally downstream of the runtime: actions produce signals; signals produce \(\hat{\mathcal{R}}\); \(\hat{\mathcal{R}}\) informs subsequent actions through the admission heuristic (§6.3) and MCTS (§6.4). The cycle closes only through the policy and scheduler, never through accounting feedback that bypasses measured beliefs.

6.3 Task admission: expected-gain scheduling (v1)

Definition 2.2 asks which READY task handler \(h\) should claim. Full MCTS over the task tree (§4) is the asymptotic answer; production ships a cheaper v1 heuristic that runs on every claim when mcts_k == 1.

When a loop worker polls GET /tasks/current or slot-driven dispatch ranks candidates, the processor’s default most_expected_gain policy sorts eligible READY tasks by a scalar expected gain implemented in aion-core/scheduler_expected_gain.py and shared with loop/task_dispatch_ranker.py:

\[ \text{gain}(T,h) \;\approx\; \underbrace{\hat\eta_T \cdot \text{payoff}(T)}_{\text{market + urgency + backlog}} \;-\; \underbrace{w_\omega \cdot \log\!\bigl(1 + \widehat{\text{cost}}_{\mathrm{mT}}(T,h)\bigr)}_{\text{LLM profile pricing}}, \tag{6.1} \]

where:

This is not yet the full \(\mathbb{E}[\Delta\mathcal{R}^{(\Sigma)}]\) of (2.2) — it is a tractable proxy that folds in the \(\eta\) axis (markets) and the \(\omega\) axis (LLM spend) without running search. Operator priority sliders inject pseudo-count mass on SUCCESS, shifting \(\hat\eta\) without contaminating predictor KL scores ([aion-core, core_tasks_priority.md, §2]).

What v1 does not solve: choosing which agent type or LLM profile should own a newly created task. That remains target_agent_type at task creation; ranked surface:handler in the recommendation engine is indexed but not yet the primary spawn path ([aion-core, recommendation_engine.md, §9]).

6.4 The MCTS path through the loop

When an agent type has mcts_k > 1 and a non-zero search budget, LoopEngine.take_a_step follows §4 verbatim:

  1. Read the merged state and snapshot it as the search root \(r\).
  2. Until budget (4.8) is exhausted: select via (4.3); expand by sampling \(K\) candidates (4.5); evaluate via the cheapest available tier of §4.4; backpropagate via (4.7).
  3. Commit by (4.9); record the search statistics in Task.state._mcts.
  4. Dispatch the committed action as a normal HTTP tool call; the rest of the loop (state diff, market trigger) is unchanged.

When mcts_k == 1 and budget is zero, the loop reduces exactly to the current production behaviour: one LLM call, one tool dispatch, no search overhead. This zero-cost fallback is the architectural commitment that lets the construction ship without breaking existing deployments.

The persisted scratch state lives under Task.state._mcts as a YAML structure. We adopt YAML rather than JSON throughout the persistence interfaces of aion-core to keep configuration files, processor state patches, and cross-service payloads in one declarative dialect; the conversion to in-memory dictionaries is symmetric, but YAML’s tolerance for comments, multiline strings, and trailing commas makes operator inspection of search traces materially easier. A representative scratch payload, after a single MCTS pass with \(K=3\) and tier-1 evaluator, has the shape:

mcts:
  run_id: 2026-05-21T07-22-11Z-h3f8
  budget:
    cpu_seconds_used: 1.84
    cpu_seconds_cap: 4.0
    tokens_used: 1180
    tokens_cap: 8000
  root_state_hash: sha256:9c4d...e1
  evaluator_tier: market_prior
  nodes:
    root:
      visits: 7
      total_dR: 1.42
      children: [n1, n2, n3]
    n1:
      action:
        tool: processor_create_task
        args:
          title: "summarise inbox"
          parent: current
      visits: 4
      total_dR: 0.93
      Q: 0.2325
    n2:
      action:
        tool: processor_complete_current
        args:
          summary: "no actionable inbox items"
      visits: 2
      total_dR: 0.31
      Q: 0.155
    n3:
      action:
        tool: machine_exec
        args:
          cmd: "rg -l 'TODO' src/"
      visits: 1
      total_dR: 0.18
      Q: 0.18
  selected_action: n1
  committed_at: 2026-05-21T07-22-13Z

The shape above is the search trace, not the action itself; only selected_action is dispatched as a real tool call. The remaining nodes are evidence of what was considered and rejected, useful for post-hoc audit, regret analysis, and resumed search after a BLOCKED transition. Operators reading the trace get a one-glance answer to the question that the greedy baseline cannot answer at all: why this action and not another?

6.5 Deployment ladder

The deployment ladder of [aion-core, persistence_objective_and_mcts.md, §7] aligns with the necessary conditions of §5.4:

Each rung is independently deployable and instrumented to detect violations of the corresponding condition: v1 surfaces \(\Phi\) collapse, v2 measures empirical regret of MCTS vs greedy on identical workloads, v3 measures sandbox–reality gap.

6.6 What is not implemented

Three pieces remain unimplemented in the current reference stack and constitute the v3-and-beyond engineering surface:

  1. Sandbox semantics. The machine has no copy-on-write fork primitive. Real rollout requires either a filesystem-level snapshot (e.g. ZFS, OverlayFS) or a process-level sandbox (e.g. Firecracker, Bubblewrap) with signed teardown to ensure stale state is not leaked back into the production data root.
  2. Persistence priors. The \(\Pi(c)\) term in (4.3) requires an action-metadata mapping that today exists only as ad-hoc heuristics. A v2 contribution is to formalise this as a PolicyTypeConfig.persistence_prior YAML schema and to learn its weights from past collapse traces.
  3. Self-correcting \(\hat{\mathcal{D}}_{KL}\). The estimator (3.6) treats market-final consensus as ground truth. When markets are themselves systematically biased — operator-driven SUCCESS calls in low-engagement deployments, for instance — the estimator under-reports delusion. A v3 fix is to weight markets by their cumulative KL margin against external benchmarks, in the spirit of the proof-of-trust ledger of [Hyyrynen, 2026c].

These gaps do not invalidate Theorems 5.1–5.3 — the theorems hold as conditional statements — but they bound which deployments can claim the conditions are met.


7 Predictions, limitations, and discussion

7.1 Falsifiable predictions

The construction yields four predictions that distinguish it both from generic “agentic LLM” stacks (which lack the persistence accounting) and from generic MCTS work (which lacks the FPE tie).

(P1) Exponential lifetime gap from MCTS. Two deployments with matched workload and matched proposer LLM but different mcts_k should show, by Theorem 5.2, an exponential separation in expected dissolution time as a function of policy delusion \(\Delta\). Concretely: the median time-to-operator-disable should scale as \(e^{(\Delta-\Delta^{\star})}\) in the gap. This is testable with A/B deployments on the same operator base; the persistence service’s leaderboard provides the dependent variable.

(P2) Negligible advantage when calibrated. Conversely, deployments whose proposer LLM is well-calibrated (low \(\hat{\mathcal{D}}_{KL}^{(L)}\) in the v1 accounting) should show no MCTS advantage, by (C1). This is a stringent test: many “MCTS helps everywhere” claims in the literature are confounded by miscalibration, and the FPE framing predicts the advantage vanishes in the calibration limit.

(P3) Phase transitions at \(\Phi\to 0\). Theorem 5.3 predicts a discontinuous drop in MCTS root value when a structurally critical sub-IPS fails — operators should observe the search abruptly devaluing actions whose subtree depends on a recently-collapsed child, even before the failure is obvious in raw success rates. This matches the “percolation-like” failure mode of [Csete & Doyle, 2004] adapted to the agent stack.

(P4) Sub-Landauer impossibility. No cognitive processor can persist below the floor \(\omega\,\mathcal{E}_\Sigma\) regardless of \(\eta_I\) improvements from MCTS. In particular, MCTS cannot save a deployment whose noise floor exceeds its income; the search will recommend NOOP-like actions and the operator will eventually disconnect. This is testable by stress-testing \(\mathcal{E}_\Sigma\) (induced tool errors) and observing the actual MCTS-committed action distribution narrow toward inaction.

7.2 Limitations

Four limitations bound the construction’s scope.

(L1) Irreversibility of side effects. The simulation step (§4.4) tier 3 promises sandbox rollouts that “discard the branch”, but no real-world side effect is fully reversible: an external API call observed by a third party cannot be unsent, a written file may have been read by a sibling process, an LLM call’s billing event has happened. Sandbox rollouts therefore approximate; the gap between sandbox-\(\hat{\mathcal{R}}\) and production-\(\hat{\mathcal{R}}\) is itself a noise term that erodes Theorem 5.1’s regret bound. Practical deployments must restrict tier-3 rollouts to actions with bounded externalisation cost.

(L2) Proposer–budget asymmetry. Theorem 5.2 assumes the proposer LLM and the MCTS evaluator share the same delusion \(\Delta\). In practice, a cheap proposer paired with an expensive evaluator can have opposite biases: the proposer is over-confident, the evaluator over-cautious. The composition is not necessarily better than either alone; (C2) requires honest leaf evaluation, and an over-cautious evaluator scaled by an over-confident proposer can produce systematic mode-collapse in search.

(L3) Persistence-of-the-deluded. The accounting (3.11) is computed by the cognitive processor’s own internal model. A processor whose model is itself deluded — over-counting \(P_{\mathrm{in}}\), under-counting \(\omega\) — will report \(\hat{\mathcal{R}}^{(\Sigma)}\) values that flatter its own decisions. MCTS optimising this proxy makes the delusion more efficient, not the persistence more secure. The architectural reply (cf. §3) is to ground critical FPE terms in external attestation: prediction-market scores from independent predictors (for \(\mathcal{D}_{KL}\)), operator value tags (for \(P_{\mathrm{in}}\)), audit-log error counts (for \(\mathcal{E}_\Sigma\)). Where external grounding is unavailable, the construction is structurally vulnerable to self-flattering optimisation, and operators should treat \(\hat{\mathcal{R}}^{(\Sigma)}\) accordingly.

(L4) Theorem 5.1 is asymptotic. The convergence rate \(O(\sqrt{\ln n / n})\) is slow; small \(n\) deployments may see no measurable gain. The remedy is the persistence prior \(\Pi(c)\) (4.4), which initialises action values usefully and shifts effective \(n\) upward, but only when the prior is itself honest — circular dependency on (L3).

7.3 Relation to existing work

The construction unifies four threads that have until now been pursued largely independently.

What we believe is novel is the explicit composition: an LLM-driven runtime whose action-selection objective is its own thermodynamic persistence, computed from runtime signals, optimised by MCTS, with provable convergence under stated conditions. Each piece exists in prior work; the whole exists, to our knowledge, only as the design described here and partially shipped in aion-core.

7.4 Boundary with non-physical claims

We close §7 with what the construction does not claim.


8 Conclusion

A cognitive processor is an information-persisting system in the same sense as a cell, a firm, or a nucleus: its identity is preserved across time only insofar as its books balance in the dimensionless currency of the persistence ratio (1.1). When the policy of such a processor is implemented as a large language model, the action-selection problem becomes the question of how to spend a finite computation budget so that the expected change in the processor’s own \(\mathcal{R}^{(\Sigma)}\) is maximised. We have shown that Monte Carlo Tree Search over the processor’s task tree, with the LLM as proposer and the prediction market’s KL reduction score as the empirical delusion estimate, is the structurally correct way to spend that budget. Theorem 5.1 gives the convergence rate; Theorem 5.2 gives the lifetime gap relative to greedy single-rollout; Theorem 5.3 gives the consistency with the FPE composition law across levels.

The construction is concrete: each MCTS phase maps to existing services in the aion-core reference stack, each FPE term maps to an observable runtime signal, and the deployment ladder respects (C1)–(C4) at successive levels of cost. It is also limited: irreversibility, proposer–budget asymmetry, the danger of optimising a self-flattering proxy, and the asymptotic nature of the convergence rate all bound the scope of the claimed gains. The honest cognitive processor is the one that audits its own ledger against external attestation and treats its tree search as a delusion shock absorber, not as a self-validating oracle.

In the framing of [Hyyrynen, 2026a], the universe is full of patterns; those that stay are exactly those whose books balance in the currency of bits. A cognitive processor is a pattern we have written down. Its books are auditable because we keep them; its persistence is decidable because we measure it. Whether any particular deployment persists in a given environment depends on whether (1.1) holds for the signals it actually has access to — and that, in practice, is the work of operators and engineers, one task at a time. MCTS over persistence is the algorithm that turns that work into a budget allocation problem; the FPE is the law that says when the allocation is honest.


Acknowledgements

The author thanks the open-source authors of the MCTS, free-energy-principle, and stochastic-thermodynamics literatures for materials freely available throughout the development of this work, and several anonymous correspondents for objections to earlier drafts. The reference implementation builds on contributions from the aion-core project, whose architecture documents are cited above.

Author contribution

LH conceived and wrote the manuscript and the reference implementation with the help of LLM based AI systems.

Declarations

The author declares no competing interests. No external funding supported this work.


References