Aion Reader → GitLab
L1–L2 · Aion LLM

FractalLM & FractalGPT

A minimal, hackable GPT training harness extended with persistence-aware loss: every token gets a per-token ratio \(\mathcal{R}_t\) that reweights the gradient — mastered patterns protect themselves, under-predicted ones are pushed hard. Chat completions expose usage.persistence for downstream bridges.

01

What it is

A fork of the nanochat lineage: minimal, single-file training stages, one complexity dial (--depth) that auto-derives all hyperparameters for compute-optimal training. On top of the standard GPT trunk sit two extensions:

  • FractalGPT — the base transformer with smear-gate context propagation and FractalMoE layers (dense-compute + sparse gating; substrate isolation via zero-gradient on inactive experts)
  • FractalLM — FractalGPT subclass that computes per-token \(\mathcal{R}_t\) on every forward pass and threads it through fractal_loss during training and into usage.persistence during inference
  • Training stages — tok_train (BPE tokenizer) → base_train (pretraining) → chat_sft (SFT) → chat_rl (RL). One shell script, one GPU node, ~$50
  • OpenAI-compatible inference — chat_web serves a /v1/chat/completions endpoint; usage.persistence in the response carries the per-sequence \(\mathcal{R}\) summary used by aion-core's blockchain bridge
02

Theory: persistence in FractalMoE

The central anti-forgetting hypothesis is architectural. A neural network is an information-persisting system: once part of it has learned useful structure, that structure should tend to stay alive unless new evidence forces a correction. Dense transformers do this poorly because the same shared interior is reused for everything.

\[ \mathcal{R}^{(\text{node})} = \Psi(\mathcal{R}^{(\text{parent})}) \cdot \frac{P_{in}\,\eta}{1 + \mathcal{D}_{KL} + \Gamma} \cdot \Phi(\mathcal{R}^{(\text{children})}) \]

FractalGPT addresses this by turning the transformer into a FractalMoE graph: sparse, semi-isolated experts arranged so that learning can happen locally instead of globally. Each expert is treated as a small persistence node. If that node has already learned useful information, the architecture should let it keep that information instead of having unrelated tasks rewrite it.

Mechanism Role in the hypothesis
Sparse expert routing Only part of the graph is active for each example, so unrelated experts are not continually rewritten.
Middle-layer concentration Most adaptation is hypothesized to happen where representations are recombined and routed between experts.
Node persistence score Measures whether an active node is preserving useful learned structure instead of dissolving it under new updates.

The claim is therefore not mainly about named gradient terms. It is that FractalMoE makes forgetting structurally harder: learning pressure concentrates in the middle layers, while the persistence of each node protects already-learned information content from global overwrite. FractalLM and fractal_loss are the training and measurement machinery wrapped around that architectural idea. This is what is tested in experiments/persistence_forgetting.md.

Academic papers
Same trade-off at organisation scale (L3). Inside the model, FractalMoE sparsifies which experts activate on each token. In aion-core, the recommendation engine and prediction markets sparsify what enters the prompt — ranked tools, processes, and files instead of full catalogs, with KL-scored markets tuning whether those shortlists stay calibrated. Weight compression here; context routing there.
03

Current state

GPT-2 speedrun leaderboard — wall-clock time to beat CORE score 0.256525 on 8×H100:

# Time val_bpb CORE Description
0168 h0.2565Original GPT-2 (OpenAI 2019)
32.76 h0.746450.2602batch size 1M tokens
51.80 h0.718080.2690autoresearch round 1
61.65 h0.718000.2626autoresearch round 2 ← current

Total cost ~$48 on an 8×H100 spot node at $3/GPU/hr. autoresearch improvements are ongoing — see autoresearch/.

TrueFractalMoE d12 — completed Jun 10 2026

First full run of the TrueFractalMoE architecture (shifted_mesh wiring, 4 experts, 2 active) on a single RTX A4000. 80,910 training steps over 1 epoch, completing in ~21 hours of GPU time.

Metric Start End (step 80,910) Δ
Loss10.563.08-71%
Validation bpb3.1730.904-71%
CORE metric0.0660.128+94%
Throughput17,345 tok/sec @ 8.8% MFU (SDPA, RTX A4000)

Single GPU, ~$7.24 total. Analysis in autoresearch/eta-analysis-2026-06-08.md.

Dense d12 Baseline — completed Jun 11 2026

Dense (no MoE) d12 trained on the same RTX A4000, same dataset, same batch size. 80,640 steps over 1 epoch in ~17 hours. Lean evals (eval_every=2000, core_metric_every=10000) cut wall time from 85h to 20h.

Metric Dense d12 TrueFractalMoE d12 Δ
Loss2.863.08-0.22 (dense wins)
Validation bpb0.8760.904-0.028 (dense wins)
CORE metric0.1360.128+0.008 (dense wins)
Throughput22,000 tok/s17,345 tok/s+27%
MFU11.3%8.8%+2.5pp
Peak VRAM8.6 GB13.6 GB-4.0 GB
Wall time~20h~85h-76%
Cost$1.69$7.24-77%

MoE overhead at d12 scale hurts rather than helps — dense wins on all metrics. Suggests MoE benefits require larger models where param-data ratio matters.

04

Quick start

Install
cd aion-llm
uv sync --extra gpu   # CUDA A100/H100
uv sync --extra cpu   # or CPU/MPS
source .venv/bin/activate
Train GPT-2 grade model & talk to it (~$50)
bash runs/speedrun.sh   # ~1.7h on 8×H100
python -m scripts.chat_web  # OpenAI-compat UI
Training dashboard (browser)
cd aion-llm/webui/ui && npm install && npm run build && cd -
python -m scripts.webui --port 8790  # live training monitor
Single depth sweep (research)
torchrun --standalone --nproc_per_node=8 \
  -m scripts.base_train -- --depth=12 --run=d12
05

Training pipeline & FractalLM inference

tok_train

Train a BPE tokenizer on your data. GPT-4-style byte fallback, configurable vocabulary size.

base_train

Pretraining with Muon optimizer, FP8 support, DDP. Single --depth dial sets all other hyperparams.

chat_sft

Supervised fine-tuning on conversation data (SmolTalk, custom JSONL). This is the first shipped distillation backend for exported aion-core corpora.

chat_rl

Reinforcement learning stage. Evaluation tasks: ARC, MMLU, GSM8K, HumanEval, SpellingBee.

FractalLM inference — usage.persistence

When serving a FractalLM checkpoint (--model-type=fractal), the /v1/chat/completions response includes a persistence field inside usage carrying per-sequence \(\mathcal{R}\) statistics. aion-core's blockchain bridge maps these to PoT vote probabilities.

"usage": {
  "prompt_tokens": 42, "completion_tokens": 18,
  "persistence": { "R_mean": 1.24, "R_min": 0.87, "R_max": 3.1 }
}
06

Use it in your own project

Drop-in OpenAI-compatible server

Any existing code that speaks the OpenAI chat API works unchanged against aion-llm. Point it at http://localhost:8000/v1 and the usage.persistence field appears automatically when using a FractalLM checkpoint.

python -m scripts.chat_web --port 8000 --model-type fractal \
  --checkpoint path/to/model.pt
Connect to aion-core

Set two env vars in aion-core/.env to point the Loop at your FractalLM instance (or any Ollama/vLLM endpoint):

AION_LLM_BASEURL=http://localhost:8000/v1
AION_LLM_MODEL_NAME=fractal-d26
Add your own training data

SFT accepts any JSONL file of conversations via customjson.py. For identity fine-tuning (personality, specialised knowledge), generate synthetic data with dev/gen_synthetic_data.py and mix it into the SFT stage. The distillation path from aion-core now exports `train.jsonl`, `val.jsonl`, and `replay.jsonl` directly into this interface.

Run the forgetting experiment

Reproduce the ACC/BWT benchmark comparing fractal loss vs vanilla CE vs EWC on synthetic continual-learning domains:

# CPU smoke test (~60s)
python -m scripts.persistence_forgetting --quick --device cpu
# Regime B — actual hypothesis test
python -m scripts.persistence_forgetting --replay-frac 0.25
07

Information stored

At this layer the model does not form most of the useful information directly. The higher layers do that first: aion-core and aion-blockchain interact with reality, test forecasts against outcomes, and accumulate validated traces of what worked. aion-llm is the compression layer that absorbs that formed information into adapters, fine-tuned checkpoints, or full-model weights.

What What it approximates Storage medium
Model weights (.pt) The probability distribution over language tokens — the most compressed description of "what the world says" ~/.cache/aion-llm/{base,chatsft,chatrl}_checkpoints/<run>/
BPE tokenizer vocabulary The compression alphabet — which byte sequences are recurring enough to deserve a single symbol .vocab / .merges alongside checkpoint directory
Training event log The learning trajectory — how far the approximation has moved toward the data distribution at each step ~/.cache/aion-llm/runs/<run_id>/events.jsonl
Training corpora / custom datasets Structured traces exported from higher layers — task histories, norm outcomes, reviews, and domain text to be distilled into the model. The current bridge emits `train`, `val`, and `replay` corpora. JSONL files in the web-UI dataset store
LoRA / finetune adapters Targeted compression of newly formed information without rewriting the whole base model. Today this is shipped as SFT/full-checkpoint distillation; true LoRA is the planned next backend. Checkpoint artifacts produced by LoRA, SFT, or full retraining runs
usage.persistence (inference) Per-token \(\mathcal{R}_t\): how well the current weights already encode the distilled approximation JSON response field; not stored locally — forwarded to aion-core

Delusion (\(\mathcal{D}_{KL}\)) when stale. The weights lag reality because they only see reality indirectly, through the training traces exported from aion-core and aion-blockchain. If those traces are stale, biased, or too narrow, the model compresses the wrong approximation. The fractal loss partially mitigates this during continual training by protecting already-mastered patterns from overwrite while still letting new validated information enter through LoRA, fine-tuning, or full retraining.

What flows in and out. Upstream, every FractalLM chat completion carries usage.persistence (R_mean, R_min, R_var) to aion-core, where it is attached to task messages and eventually informs on-chain trust. Downstream, validated task histories, norm decisions, market outcomes, and conflict patches can be exported back into training corpora that update the model through adapters, fine-tuning, or full-model training.