LayerCake

Tokenizer-free byte-patch language models with a canonical knowledge ABI and portable sparse domain bricks.

LayerCake is investigating a different way to build and extend language models:

UTF-8 bytes
  -> causal byte patches
  -> compact global core
  -> deterministic canonical ABI
  -> portable top-k sparse domain brick
  -> shared canonical output contract
  -> byte predictions

The central hypothesis is that domain knowledge can live in a fixed ABI space rather than inside tokenizer-specific or d_model-specific weights. A brick should be trainable once, copied exactly, activated sparsely, quantized, and used by independently trained LayerCake cores of different sizes.

The rolling-training branch adds a preview-guided control loop:

rubric -> non-destructive data/model preview -> syllabus -> staged training
       -> semantic gates -> model commit or rollback

This is the implementation of "show the model what it is about to train on." The preview artifact records byte entropy, fixed byte-patch compression, difficulty buckets, model BPB when available, ABI statistics when available, recommended trainable/frozen modules, curriculum mode, gates, and warnings before any destructive update runs.

The current smoke dominance harness is:

python scripts/benchmark_tier1_dominance.py --steps 4
python scripts/verify_tier1_dominance.py

It is a methodology gate, not a public scale-dominance claim.

Transformer-displacement claims are governed by dominance gates. Current locked evidence supports CPU/mobile-proxy wins for the 15M source/core and 6.8M receiver-after-transfer certificates. The local 276k/474k/735k/1.15M/2.7M/5.8M/8.8M/10.4M/12.8M/19.4M/25.6M probes now pass after adding an empirical byte-transition prior to the LayerCake path and expanding the equal-or-larger transformer matcher. These are local harness wins, not full-corpus scale-dominance claims. GPU generation remains a blocker.

This repository now contains both:

the original tokenized fixed-ABI LayerCake prototype; and
the v2 strictly causal tokenizer-free byte-patch research system.

Current measured result

The current north-star experiment is a fixed-budget comparison between a 14.79M-parameter two-byte LayerCake and a 14.84M-parameter 4,096-token BPE transformer. Both train on approximately 10.3M sampled bytes from the same local general-text stream. LayerCake uses four global and four window-local fused transformer blocks plus exact stateful cached byte generation.

Gate	LayerCake result	Comparator / threshold	Status
General held-out BPB, seed 6250	2.0446	BPE: 2.0492	PASS
General held-out BPB, seed 6263	2.0457	BPE: 2.0492	PASS
Parameters	14.792M	BPE: 14.844M	PASS
Mean training time, two seeds	121.4 s	BPE: 131.5 s	PASS
Batch-1 prefill latency	2.96 ms	BPE: 5.63 ms	PASS
Exact cached-generation BPB	1.9953 / 1.9836	BPE: 2.0492	PASS
One-thread CPU generation	2.91x / 2.96x BPE	ratio > 1	PASS
One-thread CPU no-repeat-8 generation	2.25x BPE	coherence gates pass	PASS
RTX 3080 Laptop generation	0.62x BPE	ratio > 1	FAIL

Cached generation is numerically equivalent to the trained full-forward path: the selected logit comparison differs by at most 1.9e-6 and has identical argmaxes. Local attention caches reset at the same 16-byte boundaries used during training.

Unconstrained greedy generation is not good enough: the selected 15M model loops on phrases such as "state of the state". The current certificate therefore includes a separate no-repeat-8 cached-generation gate. With that decoding constraint, LayerCake keeps a 2.25x one-thread CPU speed ratio over the matched BPE transformer and passes the tracked printable, distinct-trigram, and repeated-8-gram gates. This is a coherence improvement, not a claim of human-level long-form generation.

This is a replicated local-corpus result, not evidence of universal tokenizer-free dominance. In this repository, mobile-capable means CPU-first unless a real device is named: non-GPU desktop CPUs and one-thread x86/ARM-style mobile proxies are required deployment gates. The current CPU result is not yet a phone, NPU, battery, or thermal measurement. GPU generation remains a separate accelerator optimization target.

Raw evidence: EXPERIMENT_RESULTS.md

Verify the combined core and migration certificate:

python scripts/verify_northstar_mobile.py

New transition-head 15M frontier

The empirical byte-transition head and narrowed local decoder now produce a stronger 15M-class source/core result while preserving exact receiver migration:

Gate	LayerCake transition result	Comparator / threshold	Status
Parameters	14.320M	BPE: 14.844M	PASS
General held-out BPB	2.0382	BPE: 2.0492	PASS
Training time, no profiling	122.5 s	BPE: 131.5 s	PASS
Training bytes	9.42M	BPE: 10.32M estimated	PASS
One-thread CPU no-repeat-4 generation	2.78x BPE	ratio > 1.10 plus diversity gates	PASS
Lossless transfer to 5.40M receiver	PPL ratio 1.0; max logit diff 0; identical generation	exact	PASS
Transferred-domain BPB	1.4406	adapter: 2.1101	PASS

Verify:

python scripts/verify_scale15m_transition_frontier.py
python scripts/verify_transformer_dominance_matrix.py
python scripts/verify_game_ready_mobile_llm.py
python scripts/benchmark_cpu_deployment_resources.py
python scripts/verify_cross_backend_quality_scorecard.py
python scripts/verify_many_domain_game_layers.py
python scripts/verify_game_domain_training_workflow.py
python scripts/verify_cross_domain_smoke_frontier.py
python scripts/verify_cross_domain_adapter_frontier.py
python scripts/verify_frontier_model_northstar.py

A 15.55M active-compute conv2 transition variant also produced a 20M-comparator quality win over the retained 20.61M BPE comparator, 2.0065 BPB versus 2.0154, but it trained in 134.9 seconds versus the BPE comparator's 113.5 seconds. That is progress, not a 20M promotion.

Game-ready CPU/mobile proxy gate

The current game-deployment thesis is now tracked separately from broad scale dominance: a small CPU-first English core plus installable domain payloads for game-specific data.

Gate	Current evidence	Status
Core smaller than BPE	14.32M vs 14.84M params	PASS
General English BPB	2.0382 vs 2.0492 BPE	PASS
Training time	122.5 s vs 131.5 s BPE	PASS
One-thread CPU generation	2.78x BPE	PASS
Domain payload size	148,808 B vs 383,008 B adapter	PASS
Domain training time	51.3 s vs 183.1 s adapter	PASS
Domain CPU throughput	35.7K B/s vs 8.1K B/s adapter	PASS
Lossless domain transfer	PPL ratio 1.0; max logit diff 0; identical generation	PASS
Receiver after transfer	smaller, better BPB, faster training, faster CPU generation	PASS
Pruned CPU deployment artifact	0.96x BPE artifact size	PASS
Isolated CPU peak RSS	0.985x BPE peak RSS	PASS
Isolated CPU generation	2.13x BPE	PASS
Isolated CPU prefill microbench	0.86x BPE	OPEN

Verify:

python scripts/benchmark_cpu_deployment_resources.py
python scripts/verify_game_ready_mobile_llm.py

This is still a desktop CPU/mobile-proxy certificate. Real game shipping still requires Android/iOS or target-console latency, battery/thermal, a game-dialogue/domain dataset, task-level NPC/game QA evaluation, and a native int8 runtime. Local isolated CPU peak RSS is now measured with separate fresh Python processes and passes against the retained BPE comparator; the separate isolated prefill microbench remains open.

Cross-backend quality scorecard

LayerCake now tracks backend and quality dimensions separately so a CPU/mobile win cannot hide a GPU loss.

Dimension	Current result	Status
Training/quality/cost vs BPE	smaller, lower BPB, faster training, fewer bytes	PASS
CPU generation quality/speed	quality gates pass; 317.1 B/s vs 146.8 B/s	PASS
Batch-1 prefill latency	2.96 ms vs 5.63 ms BPE	PASS
Domain layers	smaller/faster/better than adapter; exact transfer	PASS
GPU generation quality	quality gates pass	PASS
GPU generation speed	244.2 B/s vs 840.2 B/s BPE	OPEN

Verify:

python scripts/verify_cross_backend_quality_scorecard.py

An across-the-board CPU+GPU dominance claim is blocked until GPU generation speed also beats the transformer comparator.

Frontier north-star gate

The master verifier aggregates the current promoted frontier evidence and explicitly keeps the larger north-star claim open until every remaining game/deployment gate exists.

python scripts/verify_frontier_model_northstar.py

Current promoted gates:

base 15M source/core frontier;
transformer dominance matrix promoted tiers;
cross-backend CPU/mobile-proxy scorecard;
game-ready CPU/mobile proxy;
receiver-after-transfer frontier;
many-domain install/migration/isolation mechanics.

Current open north-star items:

GPU generation speed;
20M full-corpus training-time dominance;
real mobile/device latency;
battery and thermal measurements;
isolated CPU prefill microbench;
native int8 runtime;
trained game-dialogue, lore, and quest-state payloads;
task-level NPC/game QA evaluation;
domain routing policy evaluation.

The many-domain proxy currently installs game_dialogue, game_lore, and game_quest_state payloads, verifies exact source/receiver migration for each, and checks that installing other domains does not change the selected domain's logits. It uses renamed copies of the current portable payload, so it proves install/migration/isolation mechanics, not game-domain quality.

The game-domain workflow smoke now trains a byte-GRU portable domain from tests/fixtures/game_dialogue_smoke.txt, quantizes it to int8, installs it into the 15M source and 5.40M receiver, and verifies exact migration. Current smoke metrics: 2.2185 BPB, 73.8% top-1 byte accuracy, PPL ratio 1.0, max logit diff 0.0, and identical generated bytes after transfer. This proves the train/quantize/install/migrate workflow for game-style text; it is not a production game-dialogue quality claim.

The cross-domain smoke extends that workflow to dialogue, lore, quest/state, and technical text. All four payloads train, quantize to int8, transfer exactly, and pass the smoke BPB/accuracy/printability gates. Current aggregate: mean BPB 2.2414, minimum top-1 byte accuracy 71.97%, max transfer logit diff 0.0. This is broader workflow evidence, not an all-corpora dominance claim.

The cross-domain adapter frontier compares those four portable payloads against matched BPE residual adapters trained on the same fixture files. LayerCake wins all four smoke domains on domain BPB, training seconds, payload size, and exact source/receiver transfer. Worst BPB margin is narrow on lore, -0.0019 BPB, so this is a smoke win that needs larger external corpora and multi-seed replication before any broad domain-dominance claim.

Strict same-PPL transfer

The original additive sparse brick does not preserve absolute PPL across independent cores:

Additive transfer	Source PPL	Target PPL	Ratio	Strict gate
Small cross-seed	56.82	98.91	1.74	FAIL
5.40M -> 2.19M	40.63	84.57	2.08	FAIL

The payload copies exactly, but different ABI states select different experts and the shared correction is added to different base logits.

LayerCake now has a separate core-independent lossless domain mode. A 148,736-parameter recurrent byte decoder consumes deterministic causal anchors and owns the domain prediction path instead of modifying target-core logits.

Lossless decoder transfer	PPL on both	Top-1 byte accuracy	Ratio
Small cross-seed, context 128	2.8553	72.60%	1.0000
5.40M -> 2.19M, context 256	2.7143	73.76%	1.0000
15.45M -> 5.40M, context 256	2.7143	73.76%	1.0000
int8 artifact, 15.45M -> 5.40M	2.7165	73.77%	1.0000

This proves exact domain-PPL portability for the explicit lossless mode. Additive mode uses the host model and is bounded but not exact; lossless mode is exact because its predictions do not depend on the host core.

The fp32 payload is 594,944 bytes. Symmetric per-tensor int8 storage is 148,808 bytes and increases PPL by 0.083%. The current loader dequantizes to fp32; this is compact artifact transport/storage evidence, not a native int8-kernel speed claim.

One-thread x86 CPU proxy results for 128-byte forward inference are 3.81 ms median and 33.6K bytes/s. This is not yet an Android, iOS, NPU, battery, or thermal benchmark.

Mobile deployment thesis

The evidence now supports this precise positioning:

Train a byte-level domain capsule once, verify its content hash, and install the same 149 KB int8 artifact on compatible LayerCake runtimes without retraining each host.

The tested artifact preserves its logits, PPL, byte accuracy, and deterministic output across 2.19M, 5.40M, and 15.45M LayerCake hosts. This is the mechanism LayerCake is developing for mobile and non-GPU desktop deployment: a smaller CPU-capable general core plus installable, domain-specific prediction payloads.

It is not evidence that a mobile core has the same general intelligence as a larger core. PX transfers the domain capsule's behavior exactly because that capsule owns the selected domain prediction path. Routing, task-level code quality, native CPU/mobile kernels, battery, thermal behavior, and real-device memory/latency remain separate gates. Local desktop CPU peak RSS is measured separately in the deployment-resource certificate.

Measured mobile domain-deployment win

LayerCake was compared with a matched 14.84M-parameter BPE transformer using a rank-16 residual adapter. Both systems adapted to the same local Python domain.

Metric	LayerCake PX	BPE transformer adapter	Winner
Domain BPB	1.4418	2.1101	LayerCake
Domain training time	51.3 s	183.1 s	LayerCake, 3.57x faster
Deployment artifact	148,808 B	383,008 B	LayerCake, 2.57x smaller
One-thread CPU throughput	31.9K B/s	7.1K B/s	LayerCake, 4.50x faster
RTX 3080 Laptop throughput	153.6K B/s	214.8K B/s	Transformer
Exact cross-host transfer	PASS	model-specific adapter	LayerCake

The transformer adapter has fewer trainable parameters (95,752 versus 148,736), but trains slower, produces a larger artifact, reaches worse domain BPB, and requires the full 14.84M-parameter tokenizer transformer at inference. With the adapter active, its general BPB changes from 2.041 to 2.420; it must be disabled outside the domain.

The domain-quality ordering replicated across two independent adaptation seeds: LayerCake BPB 1.4418/1.4436 versus adapter BPB 2.1101/2.0951. The unchanged payload was also reinstalled from the new 14.79M winning core into an independent 5.40M host: max logit difference 0, PPL ratio 1.0, and identical generated bytes. The matched BPE transformer no longer leads the selected general BPB or mobile CPU gates; it still wins the selected GPU generation benchmark.

Verify:

python scripts/verify_mobile_domain_win.py

Historical scaling checkpoints

The checkpoints below document the quality gap that existed before the current fused, window-local two-byte architecture. They remain useful negative controls but are no longer the repository frontier.

Naive scaling beyond the selected frontier has also been tested and rejected. A 23.69M 5+5-block LayerCake reached 2.0299 BPB in 214.1 seconds, and a 25.24M width-scaled 4+4 model reached 2.0376 BPB in 204.6 seconds. The matched 24.09M BPE transformer reached 2.0035 BPB in 158.0 seconds. The next scale step therefore requires better patch compression and fused training, not simply more dense blocks.

An intermediate 20M scale check narrows the boundary but does not clear it. A 20.25M width-448 LayerCake with 32-byte local windows and QK-normalized attention reached 2.0256 BPB in 165.1 seconds. Its matched 20.61M BPE transformer reached 2.0154 BPB in 113.5 seconds. Same-byte batch-24 compression, 16-byte local windows, and shifting one block from local byte decoding into the global patch core all improved neither gate. The best 20M LayerCake candidate remains 2.0256 BPB and 165.1 seconds; the fastest retained 20M candidate remains slower than BPE at 135.1 seconds and worse at 2.0356 BPB. The retained certificate is intentionally FAIL:

python scripts/verify_scale20m_frontier.py

A subsequent additive multi-scale experiment also failed its early rejection gate: four-byte coarse summaries combined with a two-byte fine stream reached 2.4216/2.4188 BPB at 750 steps versus 2.3180 for the fixed two-byte reference, with no training-speed gain. The implementation remains available as an experimental path, but the next full run will require content-dependent patch boundaries rather than additive fixed-scale summaries.

That content-dependent 2/4-byte follow-up was implemented and tested as well. It reached 2.514-2.524 BPB across 2.42-3.43 mean bytes per patch, versus 2.318 for the fixed two-byte probe. The vectorized path was fast, but changing patch positions damaged quality. It is therefore another documented negative control, not a selected architecture.

The strongest later scale candidate uses hardware-aligned width 512, 32-byte local windows, and QK-normalized attention. It reaches 2.0204 BPB at 25.77M parameters, but the matched 26.30M BPE transformer reaches 1.9940 BPB and trains faster. The result is recorded as progress, not a win. Exact portable-domain migration remains independent of this core quality result and continues to pass without PPL or generation changes.

The first pure-PyTorch sparse-state global patch core has also been tested at the 20M boundary. It preserves fixed two-byte ABI positions and cached generation support. The reduced-fan-in variant improved the LayerCake 20M quality frontier from 2.0256 to 2.0214 BPB while remaining smaller than the 20.61M BPE comparator, but it still lost to BPE quality and training time: BPE reached 2.0154 BPB in 113.5 seconds, while sparse-state LayerCake reached 2.0214 BPB in 248.1 seconds. The retained sparse-state certificate is therefore intentionally FAIL:

python scripts/verify_scale20m_sparse_state_frontier.py

The receiving-core comparison has also been rebuilt with the current fused architecture and a retained matched transformer artifact. The selected 6.804M receiver is still smaller than the 6.857M BPE transformer, trains faster, and beats it on general quality. The unchanged transferred domain remains exact and beats the transformer adapter.

Receiver-frontier gate	LayerCake receiver	Matched transformer	Status
Parameters	6.804M	6.857M	PASS
General BPB	2.1251	2.1265	PASS
Training time	77.05 s	81.45 s	PASS
One-thread CPU generation	1.47x BPE	ratio > 1	PASS
Transferred-domain BPB	1.4691	adapter: 2.1101	PASS
Transfer invariance	PPL ratio 1.0; max logit diff 0; identical generation	n/a	PASS

python scripts/verify_receiver_frontier.py

The next tier increases the patch core from 0.35M to 5.40M parameters and the ABI from 64 to 96 dimensions. It uses 20 MB of general text and 256-byte contexts.

Gate	5.40M result	Status
Patch vs byte parameters	5.40M vs 14.57M	PASS
Patch vs BPE parameters	5.40M vs 6.90M	PASS
Patch base inference	243.6K vs 122.1K bytes/s	PASS
Patch + brick inference	232.0K vs 122.1K bytes/s	PASS
Source Python PPL	157.03 -> 40.94	PASS
Source general ratio	1.0105	PASS
5.40M -> independent 2.19M transfer	domain ratio 0.533; general 1.021	PASS
Int8 transfer	domain ratio 0.532; general 1.021	PASS
General BPB vs matched BPE	2.261 vs 2.075	OPEN / BPE leads

This larger checkpoint preserves the architecture's size, speed, adaptation, transfer, and quantization advantages. It does not yet reproduce the small-scale BPB parity result. That negative result is part of the public evidence, not hidden.

The 15.45M patch checkpoint has completed 5,000 paired steps:

Gate	15.45M checkpoint
Parameters	15.45M vs 25.75M byte core
General BPB	2.430
25.75M byte baseline BPB	2.227
Patch inference	227.6K vs 93.0K bytes/s
Exact int8 portable-domain PPL	2.7165 on both 15.45M and 5.40M hosts
Filesystem-disjoint stdlib PPL	5.8296 on both hosts
Exact generated-byte identity	PASS

The patch model is 40.0% smaller and 2.45x faster in this CUDA benchmark. The byte model still leads quality by 0.203 BPB. This is a single-seed, 20 MB local-corpus result.

Verify it with:

python scripts/verify_scale5m_results.py

Verify the selected evidence

pip install -e .[dev]
pytest -q
python scripts/verify_research_gates.py
python scripts/verify_scale5m_results.py
python scripts/verify_scale15m_results.py
python scripts/verify_lossless_domain_results.py
python scripts/verify_mobile_domain_win.py
python scripts/verify_northstar_mobile.py
python scripts/eval_lossless_domain_decoder.py `
  --decoder runs_experiment/portable_python_gru148k_v1.pt `
  --source-core runs_experiment/scale5m_seed4242_continued.pt `
  --target-core runs_experiment/scale2m_seed5151.pt `
  --output results/lossless_domain_scale5m_to_2m.json

Expected:

all tests passed
"status": "PASS"

The verifier reads the committed result artifacts and checks every selected gate. It fails non-zero if a required metric is missing or outside its threshold.

The original structural paste proof remains available:

python verify_paste.py

What changed about cross-seed generalization

The original LayerCake prototype copied brick weights exactly but failed functionally across independent cores. A chess brick could move bit-for-bit while target PPL exploded because:

each seed learned a different ABI coordinate system; and
each seed decoded ABI deltas through a different output projection.

V2 addresses both causes:

Deterministic causal anchors: every core aligns ABI states to the same byte-prefix target basis during training.
Canonical brick head: brick deltas use a fixed ABI-to-byte-logit contract shared across interfaces, seeds, and model widths.
Correct temporal alignment: the byte state after completed patch n aligns with the context used to predict patch n+1.
General-preservation loss: brick training is constrained against the frozen base distribution and must pass an external non-regression gate.

Cross-seed failure without alignment remains an important negative control. It is no longer an unresolved explanation for the selected v2 architecture.

Why byte patches

Byte models avoid vocabulary lock-in but make global attention expensive. LayerCake uses:

a local causal decoder at byte resolution;
a smaller global transformer over patch states;
continuous local hidden state across patch boundaries;
a canonical ABI above the patch perception layer.

In the selected experiment, global sequence length is reduced 4x. The resulting patch core is smaller and faster than both the byte transformer and the trained BPE baseline while matching BPE general BPB by point estimate.

The current implementation uses fixed patches. Learned entropy/difficulty boundaries remain a scale-up target.

Portable sparse domain bricks

The selected brick has:

8 installed low-rank experts;
rank 16 per expert;
top-2 active experts;
16,897 parameters;
a residual no-op initialization;
exact state-dict portability;
optional int8 fake-quantized transfer.

Installed knowledge does not require evaluating every expert. The router scores installed experts, but only selected expert matrices execute.

This is different from:

Method	Key distinction
LoRA	LoRA matrices are shaped by each target layer and `d_model`; LayerCake bricks bind to `d_abi`.
Adapter	Ordinary adapters remain model-specific; LayerCake uses a versioned canonical coordinate/output contract.
MoE	MoE experts usually belong to one core; LayerCake bricks are portable artifacts.
RAG	RAG retrieves external context; bricks modify model behavior in ABI space.
Fine-tuning	Full tuning changes the core; brick training freezes it.
BLT-style models	BLT targets tokenizer-free dynamic byte compute; LayerCake adds portable ABI-space knowledge.

Claim ladder

Level	Meaning	Current evidence
L0	Exact weight copy	Proven
L1	Equal ABI input, equal brick function	Proven
L2	Same-core generation identity	Proven on legacy tokenized path
L3	Cross-size structural/function portability	Proven; bounded end-to-end v2 transfer passes locally
L4	Bounded additive cross-seed semantic transfer	Small-scale PASS
L5	Bounded quantized transfer	Small-scale int8 PASS
L6	Bounded tokenizer-independent byte/patch transfer	Small-scale PASS
PX	Exact core-independent portable-domain transfer	PASS through 15.45M tier
L7	Orchestrated swarm transfer	Interface implemented; task-level evidence pending

See RUBRIC.md for exact definitions.

Reproduce or extend the experiments

Core paired training:

python scripts/run_paired_byte_experiment.py `
  --seed 2028 `
  --d-model 128 --layers 3 --heads 4 `
  --patch-size 4 --continuous-local `
  --patch-d-model 96 --patch-layers 2 --patch-heads 4 `
  --d-byte 32 --d-abi 64 `
  --steps 4000 --brick-steps 1000 `
  --artifact runs_experiment/my_core.pt `
  --output results/my_core.json

Train a sparse portable brick:

python scripts/train_sparse_brick_artifact.py `
  --core runs_experiment/my_core.pt `
  --steps 6000 --rank 16 --experts 8 --top-k 2 `
  --preserve-weight 2 `
  --artifact runs_experiment/my_brick.pt `
  --output results/my_brick.json

Transfer and quantize it:

python scripts/eval_portable_brick.py `
  --brick runs_experiment/my_brick.pt `
  --target runs_experiment/another_core.pt `
  --quantize-int8 `
  --output results/my_transfer.json

Benchmark inference:

python scripts/benchmark_canonical_artifact.py `
  --core runs_experiment/my_core.pt `
  --brick runs_experiment/my_brick.pt `
  --iterations 300 --rounds 9 `
  --output results/my_inference.json

Repository map

layercake/
  abi.py                  versioned compatibility contract
  abi_alignment.py        anchor, whitening, and alignment losses
  canonical_anchors.py    deterministic seed-independent prefix basis
  causal_byte_models.py   strictly causal byte and byte-patch models
  byte_patch.py           codecs, patchers, and metadata
  domain_bricks.py        low-rank and top-k sparse portable operators
  portable_domain.py      exact core-independent domain prediction payload
  rolling/                rollbackable model-commit training substrate
  orchestration.py        CorticalSwarm-style handoff packet and router
  transfer.py             copy, PPL, and degradation contracts

scripts/
  run_paired_byte_experiment.py
  train_sparse_brick_artifact.py
  eval_portable_brick.py
  eval_lossless_domain_decoder.py
  benchmark_bpe_baseline.py
  benchmark_canonical_artifact.py
  demo_rolling_training.py
  benchmark_rolling_training.py
  benchmark_rollback_cost.py
  benchmark_cherrypick_transfer.py
  verify_research_gates.py

results/
  research_gate_certificate.json
  certificates/rolling_demo_certificate.json
  selected raw benchmark and transfer artifacts

The original flat model.py, training scripts, and paste proof remain intact for legacy reproduction.

What is not yet established

Results have not yet been replicated at 25M, 60M, 150M, or 1B scale.
The current parity result uses one selected byte-patch seed and one BPE seed.
Confidence intervals, energy-to-quality, and matched wall-clock scaling curves remain.
Dynamic learned patching is not implemented.
Native int8 kernels were not benchmarked; current int8 evidence uses quantize/dequantize.
L7 orchestration has serialization and routing tests but no end-to-end task benchmark.
Production serving, security hardening, and distributed training are not complete.

The next milestone is a 25M-class, three-seed, matched-byte experiment with frozen hashes and the same transfer matrix.

Documentation

Citation

@software{layercake2026,
  author  = {Yoder, Sam},
  title   = {LayerCake: Tokenizer-Free Byte-Patch Models with a Canonical Knowledge ABI},
  year    = {2026},
  url     = {https://github.com/Yoder23/layercake},
  license = {Apache-2.0}
}

License

Apache 2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
configs		configs
data		data
layercake		layercake
notebooks		notebooks
results		results
rubrics		rubrics
scripts		scripts
tests		tests
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
BENCHMARKS.md		BENCHMARKS.md
BLOCKERS.md		BLOCKERS.md
BRANCHING_AND_CHERRYPICK.md		BRANCHING_AND_CHERRYPICK.md
BYTE_PATCH_LAYERCAKE.md		BYTE_PATCH_LAYERCAKE.md
CITATION.cff		CITATION.cff
CLAIMS.md		CLAIMS.md
CONTRIBUTING.md		CONTRIBUTING.md
DATA.md		DATA.md
DOMAIN_PASTE.md		DOMAIN_PASTE.md
DOMINANCE_GATES.md		DOMINANCE_GATES.md
EXPERIMENT_RESULTS.md		EXPERIMENT_RESULTS.md
GITHUB_RELEASE_CHECKLIST.md		GITHUB_RELEASE_CHECKLIST.md
LICENSE		LICENSE
MODEL_COMMITS.md		MODEL_COMMITS.md
Makefile		Makefile
NEXT_STEPS.md		NEXT_STEPS.md
ORCHESTRATION.md		ORCHESTRATION.md
PREVIEW_GUIDED_TRAINING.md		PREVIEW_GUIDED_TRAINING.md
README.md		README.md
ROADMAP.md		ROADMAP.md
ROLLBACK.md		ROLLBACK.md
ROLLING_TRAINING.md		ROLLING_TRAINING.md
RUBRIC.md		RUBRIC.md
RUBRIC_TRAINING.md		RUBRIC_TRAINING.md
SCALING_PROTOCOL.md		SCALING_PROTOCOL.md
SECURITY.md		SECURITY.md
SEMANTIC_CI.md		SEMANTIC_CI.md
SKEPTICS.md		SKEPTICS.md
TOKENIZER_FREE.md		TOKENIZER_FREE.md
TRANSFORMER_BASELINES.md		TRANSFORMER_BASELINES.md
baseline_lm.py		baseline_lm.py
compare_vs_baseline.py		compare_vs_baseline.py
data.py		data.py
experiment_domain_paste.py		experiment_domain_paste.py
model.py		model.py
paste_domain.py		paste_domain.py
pyproject.toml		pyproject.toml
train_core.py		train_core.py
train_domain.py		train_domain.py
verify_paste.py		verify_paste.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LayerCake

Current measured result

New transition-head 15M frontier

Game-ready CPU/mobile proxy gate

Cross-backend quality scorecard

Frontier north-star gate

Strict same-PPL transfer

Mobile deployment thesis

Measured mobile domain-deployment win

Historical scaling checkpoints

Verify the selected evidence

What changed about cross-seed generalization

Why byte patches

Portable sparse domain bricks

Claim ladder

Reproduce or extend the experiments

Repository map

What is not yet established

Documentation

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LayerCake

Current measured result

New transition-head 15M frontier

Game-ready CPU/mobile proxy gate

Cross-backend quality scorecard

Frontier north-star gate

Strict same-PPL transfer

Mobile deployment thesis

Measured mobile domain-deployment win

Historical scaling checkpoints

Verify the selected evidence

What changed about cross-seed generalization

Why byte patches

Portable sparse domain bricks

Claim ladder

Reproduce or extend the experiments

Repository map

What is not yet established

Documentation

Citation

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages