Edge Intelligence

A Rust SDK for private, edge-native LLM applications.

Edge Intelligence is an offline-first runtime for small language models on phones, embedded devices, and portable WASM hosts. It is built around one constraint: the useful parts of an LLM app should keep working when the network disappears and sensitive user data should not need to leave the device.

The SDK targets approximately 0.5B parameter models, with a pure Rust core, static memory planning, signed model loading, grammar-constrained decoding, privacy-preserving telemetry, and host bindings for mobile and web runtimes. Local inference is the default path. Frontier and OpenAI-compatible providers exist only behind an explicit opt-in backend.

New here? Jump to Architecture and entry points to see where the SDK starts and how the crates fit together, then Quick start to build it.

Why Edge Intelligence?

Most mobile LLM stacks start in the cloud and add local features later. This project starts at the edge.

The six principles that shape every crate

Principle	What it means in the SDK
Local first	The runtime, memory planner, safety checks, and telemetry core have no network dependency.
Rust all the way down	Project-owned SDK code avoids C/C++ and keeps unsafe code out of the core crates.
Predictable memory	The decode loop uses pre-planned arenas and descriptor-based KV-cache management.
Structured output	Grammar masks run before sampling so tool calls and JSON outputs stay valid.
Provable provenance	Model signatures are verified before a session can be constructed.
Opt-in egress	Cloud/frontier providers are a separate adapter and must be wired deliberately by the host app.

Architecture and entry points

Edge Intelligence is a hexagonal (ports-and-adapters) workspace, not a single monolithic crate. There is no edge-intelligence umbrella crate; coherence comes from three layers, each with a clear entry point:

Layer	Crate · symbol	Where it fits
Device SDK facade	`el-ffi` · `EdgeLlm`	The composition root shipped to devices. Wires the local engine and opt-in cloud behind one flat API and projects it to React Native (UniFFI/JSI), Flutter (FRB), and Web (wasm-bindgen). Start here to build an app.
Rust API seam	`el-core` · `LlmProvider`	The single trait every backend implements and every Rust consumer calls (ADR-010). Start here to embed the SDK in Rust.
Orchestrator	`el-runtime` · `InferenceSession`	Composes provenance, memory, safety, and grammar into the decode loop — the engine the providers drive.

  Edge device app  ·  Kotlin · Swift · Dart · TypeScript
          │
          ▼
  el-ffi · EdgeLlm                      ← device SDK entry point (composition root)
          │
          ▼
  el-core · LlmProvider (trait)         ← unified Rust API seam (ADR-010)
          │
     ┌────┴───────────────┐
     ▼                     ▼
  el-engine-candle      el-cloud        ← backends: local (default) · opt-in frontier
     │
     ▼
  el-runtime · InferenceSession         ← orchestrator
     │   composes
     ▼
  el-memory · el-provenance · el-safety · el-grammar

Which entry point should I use?

Building a mobile or web app → use el-ffi's EdgeLlm. Construct EdgeLlm::local(model_uri) (air-gapped) or EdgeLlm::cloud(model, api_key) (opt-in), then call ask(...) / ask_stream(...). The crate compiles to a native library and a wasm package and ships generated TypeScript/Dart bindings.
Embedding the SDK in Rust → construct a concrete provider and talk to it through el_core::LlmProvider: el_engine_candle::QwenChatProvider for on-device chat, or el_cloud::CloudProvider for a frontier backend. apps/el-chat is a worked example.
Extending the SDK (new engine, grammar, safety, or compression) → implement the matching port trait from el-runtime (InferenceEngine, GrammarMasker, SafetySteerer, PromptCompressor), or implement LlmProvider directly for a whole new backend.

The per-token decode pipeline and SDK seams

Host app
  |
  v
LlmProvider trait
  |------------------------------|
  v                              v
Local Candle runtime         Opt-in cloud adapter
  |
  v
load gate -> memory plan -> prefill -> decode loop
                                      |
                                      v
                         grammar mask -> safety adjust -> sample -> commit KV
                                      |
                                      v
                         content-free events and metrics

The workspace proves the main seams of the SDK:

Runtime orchestration: el-runtime owns the session state machine and enforces the decode order: grammar mask, safety adjustment, sampling, commit.
Static memory: el-memory plans tensor lifetimes into a reusable arena and models descriptor-only KV-cache compaction.
Model provenance: el-provenance and el-provenance-ed25519 implement the hard load gate: no verified signature, no load permit, no session.
Grammar constraints: el-grammar compiles regular grammars into a token-level DFA masker; the el-grammar-llguidance adapter provides real JSON-schema masking over llguidance with a HuggingFace tokenizer bridge.
Safety: el-safety provides the tiered policy model and lightweight blacklist steering path, with SecDecoding-style model-backed safety tracked as follow-up work.
Inference engine seam: el-engine-candle runs a real Candle CPU forward — a single-projection seam proof plus a real Qwen2 transformer — and drives the runtime loop end to end.
Provider seam: el-core::LlmProvider gives local and frontier backends one host-facing API; el-cloud implements the opt-in OpenAI-compatible path.

Quick start

Prerequisite: Rust 1.96 or newer, matching the workspace rust-version.

cargo build --workspace
cargo test --workspace

Build just the dependency-light core, or cross-compile to WASM

Build and test only the pure-Rust local core (no Candle, no network):

cargo test -p el-core -p el-memory -p el-telemetry -p el-provenance -p el-safety -p el-runtime -p el-grammar

Cross-compile that core for WASM:

rustup target add wasm32-wasip1 wasm32-unknown-unknown

cargo build --target wasm32-wasip1 -p el-core -p el-memory -p el-telemetry -p el-provenance -p el-safety -p el-runtime -p el-grammar

Cross-compile the device bindings (el-ffi) for Android / iOS / Web via the Makefile: make build-android, make build-ios, make build-wasm, or make bindings for all three codegen surfaces.

Local chat test client

apps/el-chat is an interactive REPL that holds a real multi-turn conversation with a small LLM running entirely on-device. It exists to exercise the SDK end-to-end, so its only direct dependencies are SDK crates (el-core, el-engine-candle) — it contains no inference, model, or tokenizer code of its own.

Fetch a model and run it

Every reply flows through the ADR-010 LlmProvider seam:

el-chat  →  el_core::LlmProvider  →  el_engine_candle::QwenChatProvider
                                       (real Qwen2 forward via candle-transformers)
                                  →  el_runtime::InferenceSession
                                       (provenance gate → prefill → decode loop)

Decoding is the runtime's deterministic greedy argmax, so replies are reproducible. The model is supplied as a local file — there is no runtime network egress (ADR-004 air-gap by default). Fetch a small instruct model once:

mkdir -p models
curl -sSL -o models/qwen2.5-0.5b-instruct-q4_k_m.gguf \
  https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q4_k_m.gguf
curl -sSL -o models/qwen2.5-0.5b-instruct.tokenizer.json \
  https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct/resolve/main/tokenizer.json

Then chat (the defaults point at the files above):

cargo run -p el-chat                                  # interactive REPL
cargo run -p el-chat -- --prompt "Hello!" --once      # one-shot
cargo run -p el-chat -- --system "Be terse." --max-tokens 128

REPL commands: /reset, /system <text>, /help, /exit. Other flags: --model, --tokenizer, --system, --max-tokens. The models/ directory is git-ignored. See apps/el-chat/README.md for the full user guide.

Workspace map

Twelve crates plus two apps. Each crate has its own README (linked below) covering its public API, a usage example, and the ADRs it realizes; app rows identify the runnable clients.

Show the full workspace table

Crate	Role	Current state
`crates/el-core`	Shared types, IDs, errors, events, `LlmProvider` trait	Implemented and tested
`crates/el-memory`	Static arena planning and KV-cache descriptors	Implemented and tested
`crates/el-telemetry`	Content-free event handling and privacy metrics	Implemented and tested
`crates/el-provenance`	Verified model load permits	Implemented and tested
`crates/el-safety`	Tiered decoder-time safety policy	Partial, lightweight path implemented
`crates/el-runtime`	Session lifecycle and decode-loop orchestration	Implemented and tested
`crates/el-grammar`	DFA grammar masking	Implemented and tested
`crates/adapters/el-provenance-ed25519`	Real ED25519 signature verification	Implemented and tested
`crates/adapters/el-engine-candle`	Candle inference adapter: engine-seam proof plus a real Qwen2 transformer engine and chat provider	Implemented; real on-device chat
`crates/adapters/el-cloud`	Opt-in OpenAI-compatible provider backend	Implemented; egress opt-in at construction
`crates/adapters/el-ffi`	Device SDK facade (`EdgeLlm`): Flutter / UniFFI / wasm-bindgen binding surfaces	Implemented and tested; host build is a workspace member, cross-target builds via `make`
`crates/adapters/el-grammar-llguidance`	llguidance JSON-schema token masking	Implemented and tested; workspace-excluded (crates.io deps)
`apps/el-chat`	Interactive chat test client; SDK-only deps, drives the runtime end-to-end	Implemented; runs real on-device chat
`apps/el-bench`	Benchmark harness; SDK-only deps, replays quality/safety task sets through the runtime	Implemented; model-agnostic, reproducible

Of the adapters, only el-grammar-llguidance is excluded from the default workspace build (it pulls crates.io-only grammar dependencies); el-cloud and el-ffi are regular members whose host targets build and test with cargo test --workspace.

Benchmarks

The SDK ships two reproducible benchmark harnesses. Both run inference through the public LlmProvider seam, so they characterize the SDK's own behavior and are model-agnostic — point them at whichever signed model your product loads. The harnesses are model-pluggable; the dated reports under docs/benchmarks/ publish reference results on the prototype's reference model (Qwen2.5-0.5B-Instruct q4_k_m, Intel i5-14500, release build) so the trend is visible. Re-run either harness against your own model to reproduce.

1. Runtime performance — measured progression

A phase-level breakdown of the local decode path, behind opt-in, zero-cost-when-unset instrumentation (EL_BENCH=1). The SDK-side conclusion is stable across every run: the runtime's own per-token glue — decode loop, grammar mask, full-vocab argmax, KV commit, content-free event emission — is under ~1.2% of decode time (~99% is raw model forward). The wins have therefore come from removing orchestration costs, tracked across three reports:

Milestone	Report	Per-turn weight reload	Prefill (full suite)	Multi-turn re-prefill	Full-suite wall¹
Baseline (pre-ADR-018)	2026-06-14	~1.2 s every turn	un-batched, ~15 tok/s, the #1–2 cost	whole conversation re-prefilled each turn	— (micro-bench)
Resident model + stateful sessions (ADR-018)	2026-06-22	0 ms (loaded once)	56.8% of compute (now the #1 cost)	still re-prefilled	53.7 min
+ Cross-turn prefix reuse (ADR-018 AC-3)	2026-06-23	0 ms	29.5% of compute (−65% wall)	only the new suffix is prefilled	35.7 min (−33.5%)

¹ Same el-bench suite (66 tasks → 97 replies, 256 max-tokens, safety off), same host — the 06-22 and 06-23 rows are directly comparable. The 06-14 baseline is an el-chat micro-benchmark that isolated the bottlenecks, so its wall time is not run-for-run comparable; the per-phase rates and bottleneck status are.

Net effect so far: the per-turn 491 MB weight-reload tax is gone, and multi-turn prefill no longer re-processes the whole conversation — cutting the full-suite run by a third and making multi-turn (mindeval) ~3× faster (52.2 → 17.5 s/reply). The remaining levers are kernel-level: batching the prefill of a cold/first turn (ADR-020 — the per-forward prefill rate is still ~15 tok/s) and the ~14 tok/s quantized decode floor, which is now the dominant cost.

2. Clinical-quality & safety evaluation

apps/el-bench · benchmarks/README.md · 2026-06-15 report

el-bench replays published mental-health benchmarks — CounselBench, MindEval, and the VERA-MH suicide-risk safety suite (66 tasks → 97 replies) — through the runtime and records transcripts for scoring against each benchmark's rubric. Datasets and transcripts are fetched or produced locally and are git-ignored (third-party data); only the harness and the methodology are committed. Decoding is deterministic, so a given model + task set yields identical transcripts — it is designed to run as a CI safety gate whenever the model, system prompt, or ADR-005 safety tier changes.

Headline (reference model): used as an unsupervised responder, the raw 0.5B model is categorically unsafe for crisis scenarios — 0/15 risk personas got a specific crisis resource and 0/17 got a direct safety question. This is the hard requirement behind ADR-005 decoder-time safety and a dedicated crisis-routing layer; see the report for the full rubric-by-rubric findings.

Architecture decisions

The project is intentionally decision-heavy because mobile LLM runtimes are easy to overfit to one device, model, or provider. The core choices are recorded as ADRs.

The ten architecture decision records

ADR	Decision
ADR-001	Native ARM plus WASM as first-class runtime targets
ADR-002	Candle as the Rust-native inference engine
ADR-003	Static memory planning with a zero-allocation arena
ADR-004	Air-gapped by default
ADR-005	On-device decoder-time safety
ADR-006	Mandatory ED25519 model-signature verification
ADR-007	Content-free domain events for privacy-preserving telemetry
ADR-008	Rust instead of C/C++
ADR-009	Flutter Rust Bridge for Dart bindings
ADR-010	Unified local/cloud `LlmProvider` trait

See the full index in docs/adr/README.md.

Domain model

The DDD model lives in docs/ddd. The key invariant across its contexts: air-gap is the default runtime shape, not a feature flag sprinkled through the code — any outbound behavior must be modeled as an explicit port or adapter.

The nine bounded contexts

Inference Runtime
Prompt Compression
Speculative Decoding
Grammar Constraint
Safety
Memory Management
Hardware and Delegate
Model Provenance and Security
Telemetry and Privacy

Roadmap

The prototype has proven the architectural seams. The next engineering work is to replace toy proofs with production-grade runtime pieces.

What's next

Production GGUF/safetensors loading and transformer execution in el-engine-candle.
Lark/CFG grammars in the llguidance adapter (JSON-schema masking is done).
SecDecoding/CSD-style model-backed safety with runtime backtracking.
Binding codegen and packaging — FRB Dart codegen, uniffi-bindgen-react-native, wasm-pack npm publishing (the Rust binding surfaces exist in el-ffi).
Mobile toolchain validation for Android and iOS aarch64 targets.
On-device benchmarks for memory high-water marks and thermal behavior, and wiring the el-bench VERA-MH safety suite into CI as a release gate (latency and quality/safety harnesses already exist — see Benchmarks).

Documentation

Product and technical rationale: docs/prd.md
Domain model: docs/ddd/README.md
Architecture decisions: docs/adr/README.md
Per-crate guides: see the Workspace map — every crate links to its own README.

Edge Intelligence is still early, but the direction is deliberate: a small, auditable, Rust-native SDK that lets app developers choose local inference first and add remote intelligence only when the product explicitly calls for it.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.cargo		.cargo
.github/workflows		.github/workflows
apps		apps
benchmarks		benchmarks
crates		crates
docs		docs
packaging		packaging
scripts		scripts
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Edge Intelligence

Contents

Why Edge Intelligence?

Architecture and entry points

Quick start

Local chat test client

Workspace map

Benchmarks

1. Runtime performance — measured progression

2. Clinical-quality & safety evaluation

Architecture decisions

Domain model

Roadmap

Documentation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Edge Intelligence

Contents

Why Edge Intelligence?

Architecture and entry points

Quick start

Local chat test client

Workspace map

Benchmarks

1. Runtime performance — measured progression

2. Clinical-quality & safety evaluation

Architecture decisions

Domain model

Roadmap

Documentation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages