Aion LLM — FractalLM & FractalGPT

01

What it is

A fork of the nanochat lineage: minimal, single-file training stages, one complexity dial (--depth) that auto-derives all hyperparameters for compute-optimal training. On top of the standard GPT trunk sit two extensions:

— FractalGPT — the base transformer with smear-gate context propagation and FractalMoE layers (dense-compute + sparse gating; substrate isolation via zero-gradient on inactive experts)
— FractalLM — FractalGPT subclass that computes per-token $\mathcal{R}_t$ on every forward pass and threads it through fractal_loss during training and into usage.persistence during inference
— Training stages — tok_train (BPE tokenizer) → base_train (pretraining) → chat_sft (SFT) → chat_rl (RL). One shell script, one GPU node, ~$50
— OpenAI-compatible inference — chat_web serves a /v1/chat/completions endpoint; usage.persistence in the response carries the per-sequence $\mathcal{R}$ summary used by aion-core's blockchain bridge

02

Theory: persistence in FractalMoE

The central anti-forgetting hypothesis is architectural. A neural network is an information-persisting system: once part of it has learned useful structure, that structure should tend to stay alive unless new evidence forces a correction. Dense transformers do this poorly because the same shared interior is reused for everything.

\[ \mathcal{R}^{(\text{node})} = \Psi(\mathcal{R}^{(\text{parent})}) \cdot \frac{P_{in}\,\eta}{1 + \mathcal{D}_{KL} + \Gamma} \cdot \Phi(\mathcal{R}^{(\text{children})}) \]

FractalGPT addresses this by turning the transformer into a FractalMoE graph: sparse, semi-isolated experts arranged so that learning can happen locally instead of globally. Each expert is treated as a small persistence node. If that node has already learned useful information, the architecture should let it keep that information instead of having unrelated tasks rewrite it.

Mechanism	Role in the hypothesis
Sparse expert routing	Only part of the graph is active for each example, so unrelated experts are not continually rewritten.
Middle-layer concentration	Most adaptation is hypothesized to happen where representations are recombined and routed between experts.
Node persistence score	Measures whether an active node is preserving useful learned structure instead of dissolving it under new updates.

The claim is therefore not mainly about named gradient terms. It is that FractalMoE makes forgetting structurally harder: learning pressure concentrates in the middle layers, while the persistence of each node protects already-learned information content from global overwrite. FractalLM and fractal_loss are the training and measurement machinery wrapped around that architectural idea. This is what is tested in experiments/persistence_forgetting.md.

Academic papers

→ Information-Persisting Systems — the FPE law that fractal_loss implements at L1
→ Persistence vs. Catastrophic Forgetting — ACC/BWT benchmark, EWC comparison, falsification criteria

Same trade-off at organisation scale (L3). Inside the model, FractalMoE sparsifies which experts activate on each token. In aion-core, the recommendation engine and prediction markets sparsify what enters the prompt — ranked tools, processes, and files instead of full catalogs, with KL-scored markets tuning whether those shortlists stay calibrated. Weight compression here; context routing there.

03

Current state

GPT-2 speedrun leaderboard — wall-clock time to beat CORE score 0.256525 on 8×H100:

#	Time	val_bpb	CORE	Description
0	168 h	—	0.2565	Original GPT-2 (OpenAI 2019)
3	2.76 h	0.74645	0.2602	batch size 1M tokens
5	1.80 h	0.71808	0.2690	autoresearch round 1
6	1.65 h	0.71800	0.2626	autoresearch round 2 ← current

Total cost ~$48 on an 8×H100 spot node at $3/GPU/hr. autoresearch improvements are ongoing — see autoresearch/.

TrueFractalMoE d12 — completed Jun 10 2026

First full run of the TrueFractalMoE architecture (shifted_mesh wiring, 4 experts, 2 active) on a single RTX A4000. 80,910 training steps over 1 epoch, completing in ~21 hours of GPU time.

Metric	Start	End (step 80,910)	Δ
Loss	10.56	3.08	-71%
Validation bpb	3.173	0.904	-71%
CORE metric	0.066	0.128	+94%
Throughput	17,345 tok/sec @ 8.8% MFU (SDPA, RTX A4000)

Single GPU, ~$7.24 total. Analysis in autoresearch/eta-analysis-2026-06-08.md.

Dense d12 Baseline — completed Jun 11 2026

Dense (no MoE) d12 trained on the same RTX A4000, same dataset, same batch size. 80,640 steps over 1 epoch in ~17 hours. Lean evals (eval_every=2000, core_metric_every=10000) cut wall time from 85h to 20h.

Metric	Dense d12	TrueFractalMoE d12	Δ
Loss	2.86	3.08	-0.22 (dense wins)
Validation bpb	0.876	0.904	-0.028 (dense wins)
CORE metric	0.136	0.128	+0.008 (dense wins)
Throughput	22,000 tok/s	17,345 tok/s	+27%
MFU	11.3%	8.8%	+2.5pp
Peak VRAM	8.6 GB	13.6 GB	-4.0 GB
Wall time	~20h	~85h	-76%
Cost	$1.69	$7.24	-77%

MoE overhead at d12 scale hurts rather than helps — dense wins on all metrics. Suggests MoE benefits require larger models where param-data ratio matters.

04

Quick start

Install

cd aion-llm
uv sync --extra gpu # CUDA A100/H100
uv sync --extra cpu # or CPU/MPS
source .venv/bin/activate

Train GPT-2 grade model & talk to it (~$50)

bash runs/speedrun.sh # ~1.7h on 8×H100
python -m scripts.chat_web # OpenAI-compat UI

Training dashboard (browser)

cd aion-llm/webui/ui && npm install && npm run build && cd -
python -m scripts.webui --port 8790 # live training monitor

Single depth sweep (research)

torchrun --standalone --nproc_per_node=8 \
-m scripts.base_train -- --depth=12 --run=d12

05

Training pipeline & FractalLM inference

tok_train

Train a BPE tokenizer on your data. GPT-4-style byte fallback, configurable vocabulary size.

base_train

Pretraining with Muon optimizer, FP8 support, DDP. Single --depth dial sets all other hyperparams.

chat_sft

Supervised fine-tuning on conversation data (SmolTalk, custom JSONL). This is the first shipped distillation backend for exported aion-core corpora.

chat_rl

Reinforcement learning stage. Evaluation tasks: ARC, MMLU, GSM8K, HumanEval, SpellingBee.

FractalLM inference — usage.persistence

When serving a FractalLM checkpoint (--model-type=fractal), the /v1/chat/completions response includes a persistence field inside usage carrying per-sequence $\mathcal{R}$ statistics. aion-core's blockchain bridge maps these to PoT vote probabilities.

"usage": {
"prompt_tokens": 42, "completion_tokens": 18,
"persistence": { "R_mean": 1.24, "R_min": 0.87, "R_max": 3.1 }
}

06

Use it in your own project

Drop-in OpenAI-compatible server

Any existing code that speaks the OpenAI chat API works unchanged against aion-llm. Point it at http://localhost:8000/v1 and the usage.persistence field appears automatically when using a FractalLM checkpoint.

python -m scripts.chat_web --port 8000 --model-type fractal \
--checkpoint path/to/model.pt

Connect to aion-core

Set two env vars in aion-core/.env to point the Loop at your FractalLM instance (or any Ollama/vLLM endpoint):

AION_LLM_BASEURL=http://localhost:8000/v1
AION_LLM_MODEL_NAME=fractal-d26

Add your own training data

SFT accepts any JSONL file of conversations via customjson.py. For identity fine-tuning (personality, specialised knowledge), generate synthetic data with dev/gen_synthetic_data.py and mix it into the SFT stage. The distillation path from aion-core now exports `train.jsonl`, `val.jsonl`, and `replay.jsonl` directly into this interface.

Run the forgetting experiment

Reproduce the ACC/BWT benchmark comparing fractal loss vs vanilla CE vs EWC on synthetic continual-learning domains:

# CPU smoke test (~60s)
python -m scripts.persistence_forgetting --quick --device cpu
# Regime B — actual hypothesis test
python -m scripts.persistence_forgetting --replay-frac 0.25

07

Information stored

At this layer the model does not form most of the useful information directly. The higher layers do that first: aion-core and aion-blockchain interact with reality, test forecasts against outcomes, and accumulate validated traces of what worked. aion-llm is the compression layer that absorbs that formed information into adapters, fine-tuned checkpoints, or full-model weights.

What	What it approximates	Storage medium
Model weights (.pt)	The probability distribution over language tokens — the most compressed description of "what the world says"	`~/.cache/aion-llm/{base,chatsft,chatrl}_checkpoints/<run>/`
BPE tokenizer vocabulary	The compression alphabet — which byte sequences are recurring enough to deserve a single symbol	.vocab / .merges alongside checkpoint directory
Training event log	The learning trajectory — how far the approximation has moved toward the data distribution at each step	`~/.cache/aion-llm/runs/<run_id>/events.jsonl`
Training corpora / custom datasets	Structured traces exported from higher layers — task histories, norm outcomes, reviews, and domain text to be distilled into the model. The current bridge emits `train`, `val`, and `replay` corpora.	JSONL files in the web-UI dataset store
LoRA / finetune adapters	Targeted compression of newly formed information without rewriting the whole base model. Today this is shipped as SFT/full-checkpoint distillation; true LoRA is the planned next backend.	Checkpoint artifacts produced by LoRA, SFT, or full retraining runs
usage.persistence (inference)	Per-token $\mathcal{R}_t$: how well the current weights already encode the distilled approximation	JSON response field; not stored locally — forwarded to aion-core

Delusion ($\mathcal{D}_{KL}$) when stale. The weights lag reality because they only see reality indirectly, through the training traces exported from aion-core and aion-blockchain. If those traces are stale, biased, or too narrow, the model compresses the wrong approximation. The fractal loss partially mitigates this during continual training by protecting already-mastered patterns from overwrite while still letting new validated information enter through LoRA, fine-tuning, or full retraining.

What flows in and out. Upstream, every FractalLM chat completion carries usage.persistence (R_mean, R_min, R_var) to aion-core, where it is attached to task messages and eventually informs on-chain trust. Downstream, validated task histories, norm decisions, market outcomes, and conflict patches can be exported back into training corpora that update the model through adapters, fine-tuning, or full-model training.