FractalLM & FractalGPT
A minimal, hackable GPT training harness extended with persistence-aware loss:
every token gets a per-token ratio \(\mathcal{R}_t\) that reweights the gradient — mastered
patterns protect themselves, under-predicted ones are pushed hard.
Chat completions expose usage.persistence
for downstream bridges.
What it is
A fork of the nanochat lineage: minimal, single-file training stages, one complexity
dial (--depth) that auto-derives all hyperparameters for compute-optimal training.
On top of the standard GPT trunk sit two extensions:
- —
FractalGPT — the base transformer with smear-gate context propagation and
FractalMoElayers (dense-compute + sparse gating; substrate isolation via zero-gradient on inactive experts) - —
FractalLM — FractalGPT subclass that computes per-token \(\mathcal{R}_t\) on every forward pass and threads it through
fractal_lossduring training and intousage.persistenceduring inference - — Training stages — tok_train (BPE tokenizer) → base_train (pretraining) → chat_sft (SFT) → chat_rl (RL). One shell script, one GPU node, ~$50
- —
OpenAI-compatible inference — chat_web serves a
/v1/chat/completionsendpoint;usage.persistencein the response carries the per-sequence \(\mathcal{R}\) summary used by aion-core's blockchain bridge
Theory: persistence in FractalMoE
The central anti-forgetting hypothesis is architectural. A neural network is an information-persisting system: once part of it has learned useful structure, that structure should tend to stay alive unless new evidence forces a correction. Dense transformers do this poorly because the same shared interior is reused for everything.
FractalGPT addresses this by turning the transformer into a FractalMoE graph: sparse, semi-isolated experts arranged so that learning can happen locally instead of globally. Each expert is treated as a small persistence node. If that node has already learned useful information, the architecture should let it keep that information instead of having unrelated tasks rewrite it.
| Mechanism | Role in the hypothesis |
|---|---|
| Sparse expert routing | Only part of the graph is active for each example, so unrelated experts are not continually rewritten. |
| Middle-layer concentration | Most adaptation is hypothesized to happen where representations are recombined and routed between experts. |
| Node persistence score | Measures whether an active node is preserving useful learned structure instead of dissolving it under new updates. |
The claim is therefore not mainly about named gradient terms. It is that FractalMoE makes
forgetting structurally harder: learning pressure concentrates in the middle layers, while
the persistence of each node protects already-learned information content from global overwrite.
FractalLM and fractal_loss are the
training and measurement machinery wrapped around that architectural idea. This is what is tested in
experiments/persistence_forgetting.md.
- → Information-Persisting Systems — the FPE law that fractal_loss implements at L1
- → Persistence vs. Catastrophic Forgetting — ACC/BWT benchmark, EWC comparison, falsification criteria
Current state
GPT-2 speedrun leaderboard — wall-clock time to beat CORE score 0.256525 on 8×H100:
| # | Time | val_bpb | CORE | Description |
|---|---|---|---|---|
| 0 | 168 h | — | 0.2565 | Original GPT-2 (OpenAI 2019) |
| 3 | 2.76 h | 0.74645 | 0.2602 | batch size 1M tokens |
| 5 | 1.80 h | 0.71808 | 0.2690 | autoresearch round 1 |
| 6 | 1.65 h | 0.71800 | 0.2626 | autoresearch round 2 ← current |
Total cost ~$48 on an 8×H100 spot node at $3/GPU/hr. autoresearch improvements are ongoing — see autoresearch/.
First full run of the TrueFractalMoE architecture (shifted_mesh wiring, 4 experts, 2 active) on a single RTX A4000. 80,910 training steps over 1 epoch, completing in ~21 hours of GPU time.
| Metric | Start | End (step 80,910) | Δ |
|---|---|---|---|
| Loss | 10.56 | 3.08 | -71% |
| Validation bpb | 3.173 | 0.904 | -71% |
| CORE metric | 0.066 | 0.128 | +94% |
| Throughput | 17,345 tok/sec @ 8.8% MFU (SDPA, RTX A4000) | ||
Single GPU, ~$7.24 total. Analysis in autoresearch/eta-analysis-2026-06-08.md.
Dense (no MoE) d12 trained on the same RTX A4000, same dataset, same batch size. 80,640 steps over 1 epoch in ~17 hours. Lean evals (eval_every=2000, core_metric_every=10000) cut wall time from 85h to 20h.
| Metric | Dense d12 | TrueFractalMoE d12 | Δ |
|---|---|---|---|
| Loss | 2.86 | 3.08 | -0.22 (dense wins) |
| Validation bpb | 0.876 | 0.904 | -0.028 (dense wins) |
| CORE metric | 0.136 | 0.128 | +0.008 (dense wins) |
| Throughput | 22,000 tok/s | 17,345 tok/s | +27% |
| MFU | 11.3% | 8.8% | +2.5pp |
| Peak VRAM | 8.6 GB | 13.6 GB | -4.0 GB |
| Wall time | ~20h | ~85h | -76% |
| Cost | $1.69 | $7.24 | -77% |
MoE overhead at d12 scale hurts rather than helps — dense wins on all metrics. Suggests MoE benefits require larger models where param-data ratio matters.
Quick start
uv sync --extra gpu # CUDA A100/H100
uv sync --extra cpu # or CPU/MPS
source .venv/bin/activate
python -m scripts.chat_web # OpenAI-compat UI
python -m scripts.webui --port 8790 # live training monitor
-m scripts.base_train -- --depth=12 --run=d12
Training pipeline & FractalLM inference
tok_train
Train a BPE tokenizer on your data. GPT-4-style byte fallback, configurable vocabulary size.
base_train
Pretraining with Muon optimizer, FP8 support, DDP. Single --depth dial sets all other hyperparams.
chat_sft
Supervised fine-tuning on conversation data (SmolTalk, custom JSONL). This is the first shipped distillation backend for exported aion-core corpora.
chat_rl
Reinforcement learning stage. Evaluation tasks: ARC, MMLU, GSM8K, HumanEval, SpellingBee.
usage.persistence
When serving a FractalLM checkpoint (--model-type=fractal),
the /v1/chat/completions response includes
a persistence field inside usage
carrying per-sequence \(\mathcal{R}\) statistics. aion-core's blockchain bridge maps
these to PoT vote probabilities.
"prompt_tokens": 42, "completion_tokens": 18,
"persistence": { "R_mean": 1.24, "R_min": 0.87, "R_max": 3.1 }
}
Use it in your own project
Any existing code that speaks the OpenAI chat API works unchanged against aion-llm.
Point it at http://localhost:8000/v1 and
the usage.persistence field appears automatically when using a FractalLM checkpoint.
--checkpoint path/to/model.pt
Set two env vars in aion-core/.env to
point the Loop at your FractalLM instance (or any Ollama/vLLM endpoint):
AION_LLM_MODEL_NAME=fractal-d26
SFT accepts any JSONL file of conversations via customjson.py.
For identity fine-tuning (personality, specialised knowledge), generate synthetic data with
dev/gen_synthetic_data.py and mix it into the SFT stage. The
distillation path from aion-core now exports `train.jsonl`, `val.jsonl`, and `replay.jsonl`
directly into this interface.
Reproduce the ACC/BWT benchmark comparing fractal loss vs vanilla CE vs EWC on synthetic continual-learning domains:
python -m scripts.persistence_forgetting --quick --device cpu
# Regime B — actual hypothesis test
python -m scripts.persistence_forgetting --replay-frac 0.25
Information stored
At this layer the model does not form most of the useful information directly. The higher layers do that first: aion-core and aion-blockchain interact with reality, test forecasts against outcomes, and accumulate validated traces of what worked. aion-llm is the compression layer that absorbs that formed information into adapters, fine-tuned checkpoints, or full-model weights.
| What | What it approximates | Storage medium |
|---|---|---|
| Model weights (.pt) | The probability distribution over language tokens — the most compressed description of "what the world says" | ~/.cache/aion-llm/{base,chatsft,chatrl}_checkpoints/<run>/ |
| BPE tokenizer vocabulary | The compression alphabet — which byte sequences are recurring enough to deserve a single symbol | .vocab / .merges alongside checkpoint directory |
| Training event log | The learning trajectory — how far the approximation has moved toward the data distribution at each step | ~/.cache/aion-llm/runs/<run_id>/events.jsonl |
| Training corpora / custom datasets | Structured traces exported from higher layers — task histories, norm outcomes, reviews, and domain text to be distilled into the model. The current bridge emits `train`, `val`, and `replay` corpora. | JSONL files in the web-UI dataset store |
| LoRA / finetune adapters | Targeted compression of newly formed information without rewriting the whole base model. Today this is shipped as SFT/full-checkpoint distillation; true LoRA is the planned next backend. | Checkpoint artifacts produced by LoRA, SFT, or full retraining runs |
| usage.persistence (inference) | Per-token \(\mathcal{R}_t\): how well the current weights already encode the distilled approximation | JSON response field; not stored locally — forwarded to aion-core |
Delusion (\(\mathcal{D}_{KL}\)) when stale. The weights lag reality because they only see reality indirectly, through the training traces exported from aion-core and aion-blockchain. If those traces are stale, biased, or too narrow, the model compresses the wrong approximation. The fractal loss partially mitigates this during continual training by protecting already-mastered patterns from overwrite while still letting new validated information enter through LoRA, fine-tuning, or full retraining.
What flows in and out.
Upstream, every FractalLM chat completion carries usage.persistence
(R_mean, R_min, R_var) to aion-core, where it is attached to task messages and eventually
informs on-chain trust. Downstream, validated task histories, norm decisions, market
outcomes, and conflict patches can be exported back into training corpora that update the
model through adapters, fine-tuning, or full-model training.