Skip to content

Yoder23/layercake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LayerCake

Tokenizer-free byte-patch language models with a canonical knowledge ABI and portable sparse domain bricks.

Python 3.10+ License: Apache 2.0 Status: Research Preview

LayerCake is investigating a different way to build and extend language models:

UTF-8 bytes
  -> causal byte patches
  -> compact global core
  -> deterministic canonical ABI
  -> portable top-k sparse domain brick
  -> shared canonical output contract
  -> byte predictions

The central hypothesis is that domain knowledge can live in a fixed ABI space rather than inside tokenizer-specific or d_model-specific weights. A brick should be trainable once, copied exactly, activated sparsely, quantized, and used by independently trained LayerCake cores of different sizes.

The rolling-training branch adds a preview-guided control loop:

rubric -> non-destructive data/model preview -> syllabus -> staged training
       -> semantic gates -> model commit or rollback

This is the implementation of "show the model what it is about to train on." The preview artifact records byte entropy, fixed byte-patch compression, difficulty buckets, model BPB when available, ABI statistics when available, recommended trainable/frozen modules, curriculum mode, gates, and warnings before any destructive update runs.

The current smoke dominance harness is:

python scripts/benchmark_tier1_dominance.py --steps 4
python scripts/verify_tier1_dominance.py

It is a methodology gate, not a public scale-dominance claim.

Transformer-displacement claims are governed by dominance gates. Current locked evidence supports CPU/mobile-proxy wins for the 15M source/core and 6.8M receiver-after-transfer certificates. The local 276k/474k/735k/1.15M/2.7M/5.8M/8.8M/10.4M/12.8M/19.4M/25.6M probes now pass after adding an empirical byte-transition prior to the LayerCake path and expanding the equal-or-larger transformer matcher. These are local harness wins, not full-corpus scale-dominance claims. GPU generation remains a blocker.

This repository now contains both:

  • the original tokenized fixed-ABI LayerCake prototype; and
  • the v2 strictly causal tokenizer-free byte-patch research system.

Current measured result

The current north-star experiment is a fixed-budget comparison between a 14.79M-parameter two-byte LayerCake and a 14.84M-parameter 4,096-token BPE transformer. Both train on approximately 10.3M sampled bytes from the same local general-text stream. LayerCake uses four global and four window-local fused transformer blocks plus exact stateful cached byte generation.

Gate LayerCake result Comparator / threshold Status
General held-out BPB, seed 6250 2.0446 BPE: 2.0492 PASS
General held-out BPB, seed 6263 2.0457 BPE: 2.0492 PASS
Parameters 14.792M BPE: 14.844M PASS
Mean training time, two seeds 121.4 s BPE: 131.5 s PASS
Batch-1 prefill latency 2.96 ms BPE: 5.63 ms PASS
Exact cached-generation BPB 1.9953 / 1.9836 BPE: 2.0492 PASS
One-thread CPU generation 2.91x / 2.96x BPE ratio > 1 PASS
One-thread CPU no-repeat-8 generation 2.25x BPE coherence gates pass PASS
RTX 3080 Laptop generation 0.62x BPE ratio > 1 FAIL

Cached generation is numerically equivalent to the trained full-forward path: the selected logit comparison differs by at most 1.9e-6 and has identical argmaxes. Local attention caches reset at the same 16-byte boundaries used during training.

Unconstrained greedy generation is not good enough: the selected 15M model loops on phrases such as "state of the state". The current certificate therefore includes a separate no-repeat-8 cached-generation gate. With that decoding constraint, LayerCake keeps a 2.25x one-thread CPU speed ratio over the matched BPE transformer and passes the tracked printable, distinct-trigram, and repeated-8-gram gates. This is a coherence improvement, not a claim of human-level long-form generation.

This is a replicated local-corpus result, not evidence of universal tokenizer-free dominance. In this repository, mobile-capable means CPU-first unless a real device is named: non-GPU desktop CPUs and one-thread x86/ARM-style mobile proxies are required deployment gates. The current CPU result is not yet a phone, NPU, battery, or thermal measurement. GPU generation remains a separate accelerator optimization target.

Raw evidence: EXPERIMENT_RESULTS.md

Verify the combined core and migration certificate:

python scripts/verify_northstar_mobile.py

New transition-head 15M frontier

The empirical byte-transition head and narrowed local decoder now produce a stronger 15M-class source/core result while preserving exact receiver migration:

Gate LayerCake transition result Comparator / threshold Status
Parameters 14.320M BPE: 14.844M PASS
General held-out BPB 2.0382 BPE: 2.0492 PASS
Training time, no profiling 122.5 s BPE: 131.5 s PASS
Training bytes 9.42M BPE: 10.32M estimated PASS
One-thread CPU no-repeat-4 generation 2.78x BPE ratio > 1.10 plus diversity gates PASS
Lossless transfer to 5.40M receiver PPL ratio 1.0; max logit diff 0; identical generation exact PASS
Transferred-domain BPB 1.4406 adapter: 2.1101 PASS

Verify:

python scripts/verify_scale15m_transition_frontier.py
python scripts/verify_transformer_dominance_matrix.py
python scripts/verify_game_ready_mobile_llm.py
python scripts/benchmark_cpu_deployment_resources.py
python scripts/verify_cross_backend_quality_scorecard.py
python scripts/verify_many_domain_game_layers.py
python scripts/verify_game_domain_training_workflow.py
python scripts/verify_cross_domain_smoke_frontier.py
python scripts/verify_cross_domain_adapter_frontier.py
python scripts/verify_frontier_model_northstar.py

A 15.55M active-compute conv2 transition variant also produced a 20M-comparator quality win over the retained 20.61M BPE comparator, 2.0065 BPB versus 2.0154, but it trained in 134.9 seconds versus the BPE comparator's 113.5 seconds. That is progress, not a 20M promotion.

Game-ready CPU/mobile proxy gate

The current game-deployment thesis is now tracked separately from broad scale dominance: a small CPU-first English core plus installable domain payloads for game-specific data.

Gate Current evidence Status
Core smaller than BPE 14.32M vs 14.84M params PASS
General English BPB 2.0382 vs 2.0492 BPE PASS
Training time 122.5 s vs 131.5 s BPE PASS
One-thread CPU generation 2.78x BPE PASS
Domain payload size 148,808 B vs 383,008 B adapter PASS
Domain training time 51.3 s vs 183.1 s adapter PASS
Domain CPU throughput 35.7K B/s vs 8.1K B/s adapter PASS
Lossless domain transfer PPL ratio 1.0; max logit diff 0; identical generation PASS
Receiver after transfer smaller, better BPB, faster training, faster CPU generation PASS
Pruned CPU deployment artifact 0.96x BPE artifact size PASS
Isolated CPU peak RSS 0.985x BPE peak RSS PASS
Isolated CPU generation 2.13x BPE PASS
Isolated CPU prefill microbench 0.86x BPE OPEN

Verify:

python scripts/benchmark_cpu_deployment_resources.py
python scripts/verify_game_ready_mobile_llm.py

This is still a desktop CPU/mobile-proxy certificate. Real game shipping still requires Android/iOS or target-console latency, battery/thermal, a game-dialogue/domain dataset, task-level NPC/game QA evaluation, and a native int8 runtime. Local isolated CPU peak RSS is now measured with separate fresh Python processes and passes against the retained BPE comparator; the separate isolated prefill microbench remains open.

Cross-backend quality scorecard

LayerCake now tracks backend and quality dimensions separately so a CPU/mobile win cannot hide a GPU loss.

Dimension Current result Status
Training/quality/cost vs BPE smaller, lower BPB, faster training, fewer bytes PASS
CPU generation quality/speed quality gates pass; 317.1 B/s vs 146.8 B/s PASS
Batch-1 prefill latency 2.96 ms vs 5.63 ms BPE PASS
Domain layers smaller/faster/better than adapter; exact transfer PASS
GPU generation quality quality gates pass PASS
GPU generation speed 244.2 B/s vs 840.2 B/s BPE OPEN

Verify:

python scripts/verify_cross_backend_quality_scorecard.py

An across-the-board CPU+GPU dominance claim is blocked until GPU generation speed also beats the transformer comparator.

Frontier north-star gate

The master verifier aggregates the current promoted frontier evidence and explicitly keeps the larger north-star claim open until every remaining game/deployment gate exists.

python scripts/verify_frontier_model_northstar.py

Current promoted gates:

  • base 15M source/core frontier;
  • transformer dominance matrix promoted tiers;
  • cross-backend CPU/mobile-proxy scorecard;
  • game-ready CPU/mobile proxy;
  • receiver-after-transfer frontier;
  • many-domain install/migration/isolation mechanics.

Current open north-star items:

  • GPU generation speed;
  • 20M full-corpus training-time dominance;
  • real mobile/device latency;
  • battery and thermal measurements;
  • isolated CPU prefill microbench;
  • native int8 runtime;
  • trained game-dialogue, lore, and quest-state payloads;
  • task-level NPC/game QA evaluation;
  • domain routing policy evaluation.

The many-domain proxy currently installs game_dialogue, game_lore, and game_quest_state payloads, verifies exact source/receiver migration for each, and checks that installing other domains does not change the selected domain's logits. It uses renamed copies of the current portable payload, so it proves install/migration/isolation mechanics, not game-domain quality.

The game-domain workflow smoke now trains a byte-GRU portable domain from tests/fixtures/game_dialogue_smoke.txt, quantizes it to int8, installs it into the 15M source and 5.40M receiver, and verifies exact migration. Current smoke metrics: 2.2185 BPB, 73.8% top-1 byte accuracy, PPL ratio 1.0, max logit diff 0.0, and identical generated bytes after transfer. This proves the train/quantize/install/migrate workflow for game-style text; it is not a production game-dialogue quality claim.

The cross-domain smoke extends that workflow to dialogue, lore, quest/state, and technical text. All four payloads train, quantize to int8, transfer exactly, and pass the smoke BPB/accuracy/printability gates. Current aggregate: mean BPB 2.2414, minimum top-1 byte accuracy 71.97%, max transfer logit diff 0.0. This is broader workflow evidence, not an all-corpora dominance claim.

The cross-domain adapter frontier compares those four portable payloads against matched BPE residual adapters trained on the same fixture files. LayerCake wins all four smoke domains on domain BPB, training seconds, payload size, and exact source/receiver transfer. Worst BPB margin is narrow on lore, -0.0019 BPB, so this is a smoke win that needs larger external corpora and multi-seed replication before any broad domain-dominance claim.

Strict same-PPL transfer

The original additive sparse brick does not preserve absolute PPL across independent cores:

Additive transfer Source PPL Target PPL Ratio Strict gate
Small cross-seed 56.82 98.91 1.74 FAIL
5.40M -> 2.19M 40.63 84.57 2.08 FAIL

The payload copies exactly, but different ABI states select different experts and the shared correction is added to different base logits.

LayerCake now has a separate core-independent lossless domain mode. A 148,736-parameter recurrent byte decoder consumes deterministic causal anchors and owns the domain prediction path instead of modifying target-core logits.

Lossless decoder transfer PPL on both Top-1 byte accuracy Logit max diff Ratio
Small cross-seed, context 128 2.8553 72.60% 0.0 1.0000
5.40M -> 2.19M, context 256 2.7143 73.76% 0.0 1.0000
15.45M -> 5.40M, context 256 2.7143 73.76% 0.0 1.0000
int8 artifact, 15.45M -> 5.40M 2.7165 73.77% 0.0 1.0000

This proves exact domain-PPL portability for the explicit lossless mode. Additive mode uses the host model and is bounded but not exact; lossless mode is exact because its predictions do not depend on the host core.

The fp32 payload is 594,944 bytes. Symmetric per-tensor int8 storage is 148,808 bytes and increases PPL by 0.083%. The current loader dequantizes to fp32; this is compact artifact transport/storage evidence, not a native int8-kernel speed claim.

One-thread x86 CPU proxy results for 128-byte forward inference are 3.81 ms median and 33.6K bytes/s. This is not yet an Android, iOS, NPU, battery, or thermal benchmark.

Mobile deployment thesis

The evidence now supports this precise positioning:

Train a byte-level domain capsule once, verify its content hash, and install the same 149 KB int8 artifact on compatible LayerCake runtimes without retraining each host.

The tested artifact preserves its logits, PPL, byte accuracy, and deterministic output across 2.19M, 5.40M, and 15.45M LayerCake hosts. This is the mechanism LayerCake is developing for mobile and non-GPU desktop deployment: a smaller CPU-capable general core plus installable, domain-specific prediction payloads.

It is not evidence that a mobile core has the same general intelligence as a larger core. PX transfers the domain capsule's behavior exactly because that capsule owns the selected domain prediction path. Routing, task-level code quality, native CPU/mobile kernels, battery, thermal behavior, and real-device memory/latency remain separate gates. Local desktop CPU peak RSS is measured separately in the deployment-resource certificate.

Measured mobile domain-deployment win

LayerCake was compared with a matched 14.84M-parameter BPE transformer using a rank-16 residual adapter. Both systems adapted to the same local Python domain.

Metric LayerCake PX BPE transformer adapter Winner
Domain BPB 1.4418 2.1101 LayerCake
Domain training time 51.3 s 183.1 s LayerCake, 3.57x faster
Deployment artifact 148,808 B 383,008 B LayerCake, 2.57x smaller
One-thread CPU throughput 31.9K B/s 7.1K B/s LayerCake, 4.50x faster
RTX 3080 Laptop throughput 153.6K B/s 214.8K B/s Transformer
Exact cross-host transfer PASS model-specific adapter LayerCake

The transformer adapter has fewer trainable parameters (95,752 versus 148,736), but trains slower, produces a larger artifact, reaches worse domain BPB, and requires the full 14.84M-parameter tokenizer transformer at inference. With the adapter active, its general BPB changes from 2.041 to 2.420; it must be disabled outside the domain.

The domain-quality ordering replicated across two independent adaptation seeds: LayerCake BPB 1.4418/1.4436 versus adapter BPB 2.1101/2.0951. The unchanged payload was also reinstalled from the new 14.79M winning core into an independent 5.40M host: max logit difference 0, PPL ratio 1.0, and identical generated bytes. The matched BPE transformer no longer leads the selected general BPB or mobile CPU gates; it still wins the selected GPU generation benchmark.

Verify:

python scripts/verify_mobile_domain_win.py

Historical scaling checkpoints

The checkpoints below document the quality gap that existed before the current fused, window-local two-byte architecture. They remain useful negative controls but are no longer the repository frontier.

Naive scaling beyond the selected frontier has also been tested and rejected. A 23.69M 5+5-block LayerCake reached 2.0299 BPB in 214.1 seconds, and a 25.24M width-scaled 4+4 model reached 2.0376 BPB in 204.6 seconds. The matched 24.09M BPE transformer reached 2.0035 BPB in 158.0 seconds. The next scale step therefore requires better patch compression and fused training, not simply more dense blocks.

An intermediate 20M scale check narrows the boundary but does not clear it. A 20.25M width-448 LayerCake with 32-byte local windows and QK-normalized attention reached 2.0256 BPB in 165.1 seconds. Its matched 20.61M BPE transformer reached 2.0154 BPB in 113.5 seconds. Same-byte batch-24 compression, 16-byte local windows, and shifting one block from local byte decoding into the global patch core all improved neither gate. The best 20M LayerCake candidate remains 2.0256 BPB and 165.1 seconds; the fastest retained 20M candidate remains slower than BPE at 135.1 seconds and worse at 2.0356 BPB. The retained certificate is intentionally FAIL:

python scripts/verify_scale20m_frontier.py

A subsequent additive multi-scale experiment also failed its early rejection gate: four-byte coarse summaries combined with a two-byte fine stream reached 2.4216/2.4188 BPB at 750 steps versus 2.3180 for the fixed two-byte reference, with no training-speed gain. The implementation remains available as an experimental path, but the next full run will require content-dependent patch boundaries rather than additive fixed-scale summaries.

That content-dependent 2/4-byte follow-up was implemented and tested as well. It reached 2.514-2.524 BPB across 2.42-3.43 mean bytes per patch, versus 2.318 for the fixed two-byte probe. The vectorized path was fast, but changing patch positions damaged quality. It is therefore another documented negative control, not a selected architecture.

The strongest later scale candidate uses hardware-aligned width 512, 32-byte local windows, and QK-normalized attention. It reaches 2.0204 BPB at 25.77M parameters, but the matched 26.30M BPE transformer reaches 1.9940 BPB and trains faster. The result is recorded as progress, not a win. Exact portable-domain migration remains independent of this core quality result and continues to pass without PPL or generation changes.

The first pure-PyTorch sparse-state global patch core has also been tested at the 20M boundary. It preserves fixed two-byte ABI positions and cached generation support. The reduced-fan-in variant improved the LayerCake 20M quality frontier from 2.0256 to 2.0214 BPB while remaining smaller than the 20.61M BPE comparator, but it still lost to BPE quality and training time: BPE reached 2.0154 BPB in 113.5 seconds, while sparse-state LayerCake reached 2.0214 BPB in 248.1 seconds. The retained sparse-state certificate is therefore intentionally FAIL:

python scripts/verify_scale20m_sparse_state_frontier.py

The receiving-core comparison has also been rebuilt with the current fused architecture and a retained matched transformer artifact. The selected 6.804M receiver is still smaller than the 6.857M BPE transformer, trains faster, and beats it on general quality. The unchanged transferred domain remains exact and beats the transformer adapter.

Receiver-frontier gate LayerCake receiver Matched transformer Status
Parameters 6.804M 6.857M PASS
General BPB 2.1251 2.1265 PASS
Training time 77.05 s 81.45 s PASS
One-thread CPU generation 1.47x BPE ratio > 1 PASS
Transferred-domain BPB 1.4691 adapter: 2.1101 PASS
Transfer invariance PPL ratio 1.0; max logit diff 0; identical generation n/a PASS
python scripts/verify_receiver_frontier.py

The next tier increases the patch core from 0.35M to 5.40M parameters and the ABI from 64 to 96 dimensions. It uses 20 MB of general text and 256-byte contexts.

Gate 5.40M result Status
Patch vs byte parameters 5.40M vs 14.57M PASS
Patch vs BPE parameters 5.40M vs 6.90M PASS
Patch base inference 243.6K vs 122.1K bytes/s PASS
Patch + brick inference 232.0K vs 122.1K bytes/s PASS
Source Python PPL 157.03 -> 40.94 PASS
Source general ratio 1.0105 PASS
5.40M -> independent 2.19M transfer domain ratio 0.533; general 1.021 PASS
Int8 transfer domain ratio 0.532; general 1.021 PASS
General BPB vs matched BPE 2.261 vs 2.075 OPEN / BPE leads

This larger checkpoint preserves the architecture's size, speed, adaptation, transfer, and quantization advantages. It does not yet reproduce the small-scale BPB parity result. That negative result is part of the public evidence, not hidden.

The 15.45M patch checkpoint has completed 5,000 paired steps:

Gate 15.45M checkpoint
Parameters 15.45M vs 25.75M byte core
General BPB 2.430
25.75M byte baseline BPB 2.227
Patch inference 227.6K vs 93.0K bytes/s
Exact int8 portable-domain PPL 2.7165 on both 15.45M and 5.40M hosts
Filesystem-disjoint stdlib PPL 5.8296 on both hosts
Exact generated-byte identity PASS

The patch model is 40.0% smaller and 2.45x faster in this CUDA benchmark. The byte model still leads quality by 0.203 BPB. This is a single-seed, 20 MB local-corpus result.

Verify it with:

python scripts/verify_scale5m_results.py

Verify the selected evidence

pip install -e .[dev]
pytest -q
python scripts/verify_research_gates.py
python scripts/verify_scale5m_results.py
python scripts/verify_scale15m_results.py
python scripts/verify_lossless_domain_results.py
python scripts/verify_mobile_domain_win.py
python scripts/verify_northstar_mobile.py
python scripts/eval_lossless_domain_decoder.py `
  --decoder runs_experiment/portable_python_gru148k_v1.pt `
  --source-core runs_experiment/scale5m_seed4242_continued.pt `
  --target-core runs_experiment/scale2m_seed5151.pt `
  --output results/lossless_domain_scale5m_to_2m.json

Expected:

all tests passed
"status": "PASS"

The verifier reads the committed result artifacts and checks every selected gate. It fails non-zero if a required metric is missing or outside its threshold.

The original structural paste proof remains available:

python verify_paste.py

What changed about cross-seed generalization

The original LayerCake prototype copied brick weights exactly but failed functionally across independent cores. A chess brick could move bit-for-bit while target PPL exploded because:

  1. each seed learned a different ABI coordinate system; and
  2. each seed decoded ABI deltas through a different output projection.

V2 addresses both causes:

  • Deterministic causal anchors: every core aligns ABI states to the same byte-prefix target basis during training.
  • Canonical brick head: brick deltas use a fixed ABI-to-byte-logit contract shared across interfaces, seeds, and model widths.
  • Correct temporal alignment: the byte state after completed patch n aligns with the context used to predict patch n+1.
  • General-preservation loss: brick training is constrained against the frozen base distribution and must pass an external non-regression gate.

Cross-seed failure without alignment remains an important negative control. It is no longer an unresolved explanation for the selected v2 architecture.

Why byte patches

Byte models avoid vocabulary lock-in but make global attention expensive. LayerCake uses:

  • a local causal decoder at byte resolution;
  • a smaller global transformer over patch states;
  • continuous local hidden state across patch boundaries;
  • a canonical ABI above the patch perception layer.

In the selected experiment, global sequence length is reduced 4x. The resulting patch core is smaller and faster than both the byte transformer and the trained BPE baseline while matching BPE general BPB by point estimate.

The current implementation uses fixed patches. Learned entropy/difficulty boundaries remain a scale-up target.

Portable sparse domain bricks

The selected brick has:

  • 8 installed low-rank experts;
  • rank 16 per expert;
  • top-2 active experts;
  • 16,897 parameters;
  • a residual no-op initialization;
  • exact state-dict portability;
  • optional int8 fake-quantized transfer.

Installed knowledge does not require evaluating every expert. The router scores installed experts, but only selected expert matrices execute.

This is different from:

Method Key distinction
LoRA LoRA matrices are shaped by each target layer and d_model; LayerCake bricks bind to d_abi.
Adapter Ordinary adapters remain model-specific; LayerCake uses a versioned canonical coordinate/output contract.
MoE MoE experts usually belong to one core; LayerCake bricks are portable artifacts.
RAG RAG retrieves external context; bricks modify model behavior in ABI space.
Fine-tuning Full tuning changes the core; brick training freezes it.
BLT-style models BLT targets tokenizer-free dynamic byte compute; LayerCake adds portable ABI-space knowledge.

Claim ladder

Level Meaning Current evidence
L0 Exact weight copy Proven
L1 Equal ABI input, equal brick function Proven
L2 Same-core generation identity Proven on legacy tokenized path
L3 Cross-size structural/function portability Proven; bounded end-to-end v2 transfer passes locally
L4 Bounded additive cross-seed semantic transfer Small-scale PASS
L5 Bounded quantized transfer Small-scale int8 PASS
L6 Bounded tokenizer-independent byte/patch transfer Small-scale PASS
PX Exact core-independent portable-domain transfer PASS through 15.45M tier
L7 Orchestrated swarm transfer Interface implemented; task-level evidence pending

See RUBRIC.md for exact definitions.

Reproduce or extend the experiments

Core paired training:

python scripts/run_paired_byte_experiment.py `
  --seed 2028 `
  --d-model 128 --layers 3 --heads 4 `
  --patch-size 4 --continuous-local `
  --patch-d-model 96 --patch-layers 2 --patch-heads 4 `
  --d-byte 32 --d-abi 64 `
  --steps 4000 --brick-steps 1000 `
  --artifact runs_experiment/my_core.pt `
  --output results/my_core.json

Train a sparse portable brick:

python scripts/train_sparse_brick_artifact.py `
  --core runs_experiment/my_core.pt `
  --steps 6000 --rank 16 --experts 8 --top-k 2 `
  --preserve-weight 2 `
  --artifact runs_experiment/my_brick.pt `
  --output results/my_brick.json

Transfer and quantize it:

python scripts/eval_portable_brick.py `
  --brick runs_experiment/my_brick.pt `
  --target runs_experiment/another_core.pt `
  --quantize-int8 `
  --output results/my_transfer.json

Benchmark inference:

python scripts/benchmark_canonical_artifact.py `
  --core runs_experiment/my_core.pt `
  --brick runs_experiment/my_brick.pt `
  --iterations 300 --rounds 9 `
  --output results/my_inference.json

Repository map

layercake/
  abi.py                  versioned compatibility contract
  abi_alignment.py        anchor, whitening, and alignment losses
  canonical_anchors.py    deterministic seed-independent prefix basis
  causal_byte_models.py   strictly causal byte and byte-patch models
  byte_patch.py           codecs, patchers, and metadata
  domain_bricks.py        low-rank and top-k sparse portable operators
  portable_domain.py      exact core-independent domain prediction payload
  rolling/                rollbackable model-commit training substrate
  orchestration.py        CorticalSwarm-style handoff packet and router
  transfer.py             copy, PPL, and degradation contracts

scripts/
  run_paired_byte_experiment.py
  train_sparse_brick_artifact.py
  eval_portable_brick.py
  eval_lossless_domain_decoder.py
  benchmark_bpe_baseline.py
  benchmark_canonical_artifact.py
  demo_rolling_training.py
  benchmark_rolling_training.py
  benchmark_rollback_cost.py
  benchmark_cherrypick_transfer.py
  verify_research_gates.py

results/
  research_gate_certificate.json
  certificates/rolling_demo_certificate.json
  selected raw benchmark and transfer artifacts

The original flat model.py, training scripts, and paste proof remain intact for legacy reproduction.

What is not yet established

  • Results have not yet been replicated at 25M, 60M, 150M, or 1B scale.
  • The current parity result uses one selected byte-patch seed and one BPE seed.
  • Confidence intervals, energy-to-quality, and matched wall-clock scaling curves remain.
  • Dynamic learned patching is not implemented.
  • Native int8 kernels were not benchmarked; current int8 evidence uses quantize/dequantize.
  • L7 orchestration has serialization and routing tests but no end-to-end task benchmark.
  • Production serving, security hardening, and distributed training are not complete.

The next milestone is a 25M-class, three-seed, matched-byte experiment with frozen hashes and the same transfer matrix.

Documentation

Citation

@software{layercake2026,
  author  = {Yoder, Sam},
  title   = {LayerCake: Tokenizer-Free Byte-Patch Models with a Canonical Knowledge ABI},
  year    = {2026},
  url     = {https://github.com/Yoder23/layercake},
  license = {Apache-2.0}
}

License

Apache 2.0. See LICENSE.

About

Modular language models via a fixed-dimension ABI bottleneck — domain modules are bit-exactly portable across model sizes

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors