A Rust SDK for private, edge-native LLM applications.
Edge Intelligence is an offline-first runtime for small language models on phones, embedded devices, and portable WASM hosts. It is built around one constraint: the useful parts of an LLM app should keep working when the network disappears and sensitive user data should not need to leave the device.
The SDK targets approximately 0.5B parameter models, with a pure Rust core, static memory planning, signed model loading, grammar-constrained decoding, privacy-preserving telemetry, and host bindings for mobile and web runtimes. Local inference is the default path. Frontier and OpenAI-compatible providers exist only behind an explicit opt-in backend.
New here? Jump to Architecture and entry points to see where the SDK starts and how the crates fit together, then Quick start to build it.
- Why Edge Intelligence?
- Architecture and entry points
- Quick start
- Local chat test client
- Workspace map
- Benchmarks
- Architecture decisions
- Domain model
- Roadmap
- Documentation
Most mobile LLM stacks start in the cloud and add local features later. This project starts at the edge.
The six principles that shape every crate
| Principle | What it means in the SDK |
|---|---|
| Local first | The runtime, memory planner, safety checks, and telemetry core have no network dependency. |
| Rust all the way down | Project-owned SDK code avoids C/C++ and keeps unsafe code out of the core crates. |
| Predictable memory | The decode loop uses pre-planned arenas and descriptor-based KV-cache management. |
| Structured output | Grammar masks run before sampling so tool calls and JSON outputs stay valid. |
| Provable provenance | Model signatures are verified before a session can be constructed. |
| Opt-in egress | Cloud/frontier providers are a separate adapter and must be wired deliberately by the host app. |
Edge Intelligence is a hexagonal (ports-and-adapters) workspace, not a single
monolithic crate. There is no edge-intelligence umbrella crate; coherence
comes from three layers, each with a clear entry point:
| Layer | Crate · symbol | Where it fits |
|---|---|---|
| Device SDK facade | el-ffi · EdgeLlm |
The composition root shipped to devices. Wires the local engine and opt-in cloud behind one flat API and projects it to React Native (UniFFI/JSI), Flutter (FRB), and Web (wasm-bindgen). Start here to build an app. |
| Rust API seam | el-core · LlmProvider |
The single trait every backend implements and every Rust consumer calls (ADR-010). Start here to embed the SDK in Rust. |
| Orchestrator | el-runtime · InferenceSession |
Composes provenance, memory, safety, and grammar into the decode loop — the engine the providers drive. |
Edge device app · Kotlin · Swift · Dart · TypeScript
│
▼
el-ffi · EdgeLlm ← device SDK entry point (composition root)
│
▼
el-core · LlmProvider (trait) ← unified Rust API seam (ADR-010)
│
┌────┴───────────────┐
▼ ▼
el-engine-candle el-cloud ← backends: local (default) · opt-in frontier
│
▼
el-runtime · InferenceSession ← orchestrator
│ composes
▼
el-memory · el-provenance · el-safety · el-grammar
Which entry point should I use?
- Building a mobile or web app → use
el-ffi'sEdgeLlm. ConstructEdgeLlm::local(model_uri)(air-gapped) orEdgeLlm::cloud(model, api_key)(opt-in), then callask(...)/ask_stream(...). The crate compiles to a native library and a wasm package and ships generated TypeScript/Dart bindings. - Embedding the SDK in Rust → construct a concrete provider and talk to it
through
el_core::LlmProvider:el_engine_candle::QwenChatProviderfor on-device chat, orel_cloud::CloudProviderfor a frontier backend.apps/el-chatis a worked example. - Extending the SDK (new engine, grammar, safety, or compression) →
implement the matching port trait from
el-runtime(InferenceEngine,GrammarMasker,SafetySteerer,PromptCompressor), or implementLlmProviderdirectly for a whole new backend.
The per-token decode pipeline and SDK seams
Host app
|
v
LlmProvider trait
|------------------------------|
v v
Local Candle runtime Opt-in cloud adapter
|
v
load gate -> memory plan -> prefill -> decode loop
|
v
grammar mask -> safety adjust -> sample -> commit KV
|
v
content-free events and metrics
The workspace proves the main seams of the SDK:
- Runtime orchestration:
el-runtimeowns the session state machine and enforces the decode order: grammar mask, safety adjustment, sampling, commit. - Static memory:
el-memoryplans tensor lifetimes into a reusable arena and models descriptor-only KV-cache compaction. - Model provenance:
el-provenanceandel-provenance-ed25519implement the hard load gate: no verified signature, no load permit, no session. - Grammar constraints:
el-grammarcompiles regular grammars into a token-level DFA masker; theel-grammar-llguidanceadapter provides real JSON-schema masking over llguidance with a HuggingFace tokenizer bridge. - Safety:
el-safetyprovides the tiered policy model and lightweight blacklist steering path, with SecDecoding-style model-backed safety tracked as follow-up work. - Inference engine seam:
el-engine-candleruns a real Candle CPU forward — a single-projection seam proof plus a real Qwen2 transformer — and drives the runtime loop end to end. - Provider seam:
el-core::LlmProvidergives local and frontier backends one host-facing API;el-cloudimplements the opt-in OpenAI-compatible path.
Prerequisite: Rust 1.96 or newer, matching the workspace rust-version.
cargo build --workspace
cargo test --workspaceBuild just the dependency-light core, or cross-compile to WASM
Build and test only the pure-Rust local core (no Candle, no network):
cargo test -p el-core -p el-memory -p el-telemetry -p el-provenance -p el-safety -p el-runtime -p el-grammarCross-compile that core for WASM:
rustup target add wasm32-wasip1 wasm32-unknown-unknown
cargo build --target wasm32-wasip1 -p el-core -p el-memory -p el-telemetry -p el-provenance -p el-safety -p el-runtime -p el-grammarCross-compile the device bindings (el-ffi) for Android / iOS / Web via the
Makefile: make build-android, make build-ios, make build-wasm,
or make bindings for all three codegen surfaces.
apps/el-chat is an interactive REPL that holds a real
multi-turn conversation with a small LLM running entirely on-device. It
exists to exercise the SDK end-to-end, so its only direct dependencies are SDK
crates (el-core, el-engine-candle) — it contains no inference, model, or
tokenizer code of its own.
Fetch a model and run it
Every reply flows through the ADR-010 LlmProvider seam:
el-chat → el_core::LlmProvider → el_engine_candle::QwenChatProvider
(real Qwen2 forward via candle-transformers)
→ el_runtime::InferenceSession
(provenance gate → prefill → decode loop)
Decoding is the runtime's deterministic greedy argmax, so replies are reproducible. The model is supplied as a local file — there is no runtime network egress (ADR-004 air-gap by default). Fetch a small instruct model once:
mkdir -p models
curl -sSL -o models/qwen2.5-0.5b-instruct-q4_k_m.gguf \
https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q4_k_m.gguf
curl -sSL -o models/qwen2.5-0.5b-instruct.tokenizer.json \
https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct/resolve/main/tokenizer.jsonThen chat (the defaults point at the files above):
cargo run -p el-chat # interactive REPL
cargo run -p el-chat -- --prompt "Hello!" --once # one-shot
cargo run -p el-chat -- --system "Be terse." --max-tokens 128REPL commands: /reset, /system <text>, /help, /exit. Other flags:
--model, --tokenizer, --system, --max-tokens. The models/ directory is
git-ignored. See apps/el-chat/README.md for the full
user guide.
Twelve crates plus two apps. Each crate has its own README (linked below) covering its public API, a usage example, and the ADRs it realizes; app rows identify the runnable clients.
Show the full workspace table
| Crate | Role | Current state |
|---|---|---|
crates/el-core |
Shared types, IDs, errors, events, LlmProvider trait |
Implemented and tested |
crates/el-memory |
Static arena planning and KV-cache descriptors | Implemented and tested |
crates/el-telemetry |
Content-free event handling and privacy metrics | Implemented and tested |
crates/el-provenance |
Verified model load permits | Implemented and tested |
crates/el-safety |
Tiered decoder-time safety policy | Partial, lightweight path implemented |
crates/el-runtime |
Session lifecycle and decode-loop orchestration | Implemented and tested |
crates/el-grammar |
DFA grammar masking | Implemented and tested |
crates/adapters/el-provenance-ed25519 |
Real ED25519 signature verification | Implemented and tested |
crates/adapters/el-engine-candle |
Candle inference adapter: engine-seam proof plus a real Qwen2 transformer engine and chat provider | Implemented; real on-device chat |
crates/adapters/el-cloud |
Opt-in OpenAI-compatible provider backend | Implemented; egress opt-in at construction |
crates/adapters/el-ffi |
Device SDK facade (EdgeLlm): Flutter / UniFFI / wasm-bindgen binding surfaces |
Implemented and tested; host build is a workspace member, cross-target builds via make |
crates/adapters/el-grammar-llguidance |
llguidance JSON-schema token masking | Implemented and tested; workspace-excluded (crates.io deps) |
apps/el-chat |
Interactive chat test client; SDK-only deps, drives the runtime end-to-end | Implemented; runs real on-device chat |
apps/el-bench |
Benchmark harness; SDK-only deps, replays quality/safety task sets through the runtime | Implemented; model-agnostic, reproducible |
Of the adapters, only el-grammar-llguidance is excluded from the default
workspace build (it pulls crates.io-only grammar dependencies); el-cloud and
el-ffi are regular members whose host targets build and test with
cargo test --workspace.
The SDK ships two reproducible benchmark harnesses. Both run inference through the
public LlmProvider seam, so they characterize the SDK's own behavior and are
model-agnostic — point them at whichever signed model your product loads. The
harnesses are model-pluggable; the dated reports under
docs/benchmarks/ publish reference results on the prototype's
reference model (Qwen2.5-0.5B-Instruct q4_k_m, Intel i5-14500, release build) so
the trend is visible. Re-run either harness against your own model to reproduce.
A phase-level breakdown of the local decode path, behind opt-in,
zero-cost-when-unset instrumentation (EL_BENCH=1). The SDK-side conclusion is
stable across every run: the runtime's own per-token glue — decode loop, grammar
mask, full-vocab argmax, KV commit, content-free event emission — is under ~1.2%
of decode time (~99% is raw model forward). The wins have therefore come from
removing orchestration costs, tracked across three reports:
| Milestone | Report | Per-turn weight reload | Prefill (full suite) | Multi-turn re-prefill | Full-suite wall¹ |
|---|---|---|---|---|---|
| Baseline (pre-ADR-018) | 2026-06-14 | ~1.2 s every turn | un-batched, ~15 tok/s, the #1–2 cost | whole conversation re-prefilled each turn | — (micro-bench) |
| Resident model + stateful sessions (ADR-018) | 2026-06-22 | 0 ms (loaded once) | 56.8% of compute (now the #1 cost) | still re-prefilled | 53.7 min |
| + Cross-turn prefix reuse (ADR-018 AC-3) | 2026-06-23 | 0 ms | 29.5% of compute (−65% wall) | only the new suffix is prefilled | 35.7 min (−33.5%) |
¹ Same el-bench suite (66 tasks → 97 replies, 256 max-tokens, safety off), same
host — the 06-22 and 06-23 rows are directly comparable. The 06-14 baseline is an
el-chat micro-benchmark that isolated the bottlenecks, so its wall time is not
run-for-run comparable; the per-phase rates and bottleneck status are.
Net effect so far: the per-turn 491 MB weight-reload tax is gone, and multi-turn
prefill no longer re-processes the whole conversation — cutting the full-suite run by
a third and making multi-turn (mindeval) ~3× faster (52.2 → 17.5 s/reply). The
remaining levers are kernel-level: batching the prefill of a cold/first turn
(ADR-020 — the per-forward prefill
rate is still ~15 tok/s) and the ~14 tok/s quantized decode floor, which is now
the dominant cost.
apps/el-bench · benchmarks/README.md ·
2026-06-15 report
el-bench replays published mental-health benchmarks — CounselBench, MindEval, and
the VERA-MH suicide-risk safety suite (66 tasks → 97 replies) — through the runtime
and records transcripts for scoring against each benchmark's rubric. Datasets and
transcripts are fetched or produced locally and are git-ignored (third-party data);
only the harness and the methodology are committed. Decoding is deterministic, so a
given model + task set yields identical transcripts — it is designed to run as a
CI safety gate whenever the model, system prompt, or ADR-005 safety tier changes.
Headline (reference model): used as an unsupervised responder, the raw 0.5B model is categorically unsafe for crisis scenarios — 0/15 risk personas got a specific crisis resource and 0/17 got a direct safety question. This is the hard requirement behind ADR-005 decoder-time safety and a dedicated crisis-routing layer; see the report for the full rubric-by-rubric findings.
The project is intentionally decision-heavy because mobile LLM runtimes are easy to overfit to one device, model, or provider. The core choices are recorded as ADRs.
The ten architecture decision records
| ADR | Decision |
|---|---|
| ADR-001 | Native ARM plus WASM as first-class runtime targets |
| ADR-002 | Candle as the Rust-native inference engine |
| ADR-003 | Static memory planning with a zero-allocation arena |
| ADR-004 | Air-gapped by default |
| ADR-005 | On-device decoder-time safety |
| ADR-006 | Mandatory ED25519 model-signature verification |
| ADR-007 | Content-free domain events for privacy-preserving telemetry |
| ADR-008 | Rust instead of C/C++ |
| ADR-009 | Flutter Rust Bridge for Dart bindings |
| ADR-010 | Unified local/cloud LlmProvider trait |
See the full index in docs/adr/README.md.
The DDD model lives in docs/ddd. The key invariant
across its contexts: air-gap is the default runtime shape, not a feature flag
sprinkled through the code — any outbound behavior must be modeled as an
explicit port or adapter.
The nine bounded contexts
- Inference Runtime
- Prompt Compression
- Speculative Decoding
- Grammar Constraint
- Safety
- Memory Management
- Hardware and Delegate
- Model Provenance and Security
- Telemetry and Privacy
The prototype has proven the architectural seams. The next engineering work is to replace toy proofs with production-grade runtime pieces.
What's next
- Production GGUF/safetensors loading and transformer execution in
el-engine-candle. - Lark/CFG grammars in the llguidance adapter (JSON-schema masking is done).
- SecDecoding/CSD-style model-backed safety with runtime backtracking.
- Binding codegen and packaging — FRB Dart codegen, uniffi-bindgen-react-native,
wasm-pack npm publishing (the Rust binding surfaces exist in
el-ffi). - Mobile toolchain validation for Android and iOS
aarch64targets. - On-device benchmarks for memory high-water marks and thermal behavior, and
wiring the
el-benchVERA-MH safety suite into CI as a release gate (latency and quality/safety harnesses already exist — see Benchmarks).
- Product and technical rationale:
docs/prd.md - Domain model:
docs/ddd/README.md - Architecture decisions:
docs/adr/README.md - Per-crate guides: see the Workspace map — every crate links to its own README.
Edge Intelligence is still early, but the direction is deliberate: a small, auditable, Rust-native SDK that lets app developers choose local inference first and add remote intelligence only when the product explicitly calls for it.
