feat(firewall): payload sizing, byte budgets, tiktoken counter, honest summaries by dgenio · Pull Request #237 · dgenio/agent-kernel

dgenio · 2026-06-22T07:48:10Z

Closes #207, closes #211, closes #218, closes #174.

This is the grouped firewall PR identified during triage: four issues that share one implementation path — how the context firewall measures, bounds, and faithfully represents tool-output payloads before they reach the LLM. They live in firewall/ + handles.py and are cleaner together (a single shared size estimator) than apart.

What changed

firewall/size_estimate.py (new) — Avoid serializing entire payloads to measure size in the firewall hot path #207. Allocation-free estimated_size(value, *, limit=None) that walks a structure to approximate its serialised length instead of json.dumps-ing it. transform.py now uses it for the raw-mode budget check (and drops import json), so a multi-MB raw payload is no longer fully serialised purely to measure it. Deterministic; exact for scalar containers, bounded error otherwise (only threshold comparisons depend on it).
handles.py — Add byte-size-aware budgeting and eviction to HandleStore #211. HandleStore gains opt-in max_total_bytes (evict oldest-first, never the just-stored entry) and max_entry_bytes (reject an over-cap payload with the new HandleTooLarge error rather than truncating it — keeps expansion faithful). Both default to None → behaviour unchanged. current_bytes exposes resident usage; eviction/expiry paths maintain the accounting.
handle_query.py (new) — refactor. The HandleStore.expand query pipeline (validation → filter → paginate → project → redact → Frame) moves verbatim into a free expand_handle(); the store keeps a thin delegator that injects its TTL-checked fetch via a lambda. This was required to stay within the shrink-only module-size ratchet after adding the byte-budget code, and it brings handles.py from 371 → 264 lines (also chips at Bring oversized modules back within the documented 300-line budget #188). handles.py is removed from _SIZE_RATCHET per the documented "shrank below 300 → drop it" rule.
firewall/token_counting_tiktoken.py (new) — Implement the reserved tiktoken extra for accurate token counting #218. make_tiktoken_counter(encoding="cl100k_base") implements the existing TokenCounter seam with a real tokenizer: lazy import tiktoken (base install stays light), cached encoder, eager resolution so a missing extra (ImportError) / unknown encoding (FirewallError) fails at construction. Exported from weaver_kernel.firewall; wired via BudgetManager(token_counter=...).
firewall/summarize.py — Surface truncation explicitly in summaries and fix boolean handling in summarize() #174. Boolean columns are reported as True/False counts instead of being averaged (a bool is an int subclass — "mean of is_active = 0.7" is nonsense); truncated fact lists end in an explicit … (N more facts omitted; full data via handle) marker, never silently.
Docs/metadata: docs/context_firewall.md (budget estimation, byte budgets, tiktoken usage, summary semantics), CHANGELOG.md, HandleTooLarge exported, tiktoken.* mypy override.

Why

Grouped per the triage: same module, same bug category (size/budget accounting), one shared estimator. Decisions and assumptions (Mode B):

tiktoken default encoding cl100k_base (broadest compatibility; o200k_base documented for GPT-4o/o-series) — the encoding is explicit, not guessed per model.
Oversized handle entry → reject (not silent truncate), because a truncated raw dataset would corrupt expansion and the goal is bounding resident raw data.
The issue's reference to Firewall(token_counter=…) is imprecise — the real seam is BudgetManager(token_counter=…); wired there.
tiktoken is not added to [dev]: it downloads BPE vocabularies over the network on first use, which must not be a test dependency. The counter's logic is unit-tested against a fake encoder injected into sys.modules, fully offline.

How verified

make ci → EXIT 0: ruff format --check (117 files), ruff check (all passed), mypy --strict, pytest 762 passed, 1 skipped (branch-coverage fail_under gate met), all 13 examples run.
New tests: tests/test_size_estimate.py (scalars, exactness on scalar containers, ±25% bounded error corpus, early-exit, non-serialisable fallback, deep-nesting no-recursion), tests/test_token_counting_tiktoken.py (fake-encoder: value handling, unknown-encoding FirewallError, missing-extra ImportError, BudgetManager wiring). Extended tests/test_firewall.py (bool-not-averaged, numeric still averaged, truncation marker present/absent, exactly-max_facts no false marker) and tests/test_handles.py (per-entry reject, total-bytes eviction order, evicted-handle expand → HandleNotFound, accounting release, invalid-config). Added size_estimate to the doctest gate.
Behaviour-preserving: the existing test_handles.py expand suite (now exercising the delegator → handle_query) passes unchanged.

Risks / caveats

estimated_size is approximate; it preserves the only size-driven decision (raw-mode warning) and is bounded by tests, but it is not byte-identical to json.dumps.
max_entry_bytes rejection raises from store(); only callers that opt in are affected (kernel default is None, so the invoke path is untouched). Kernel-level wiring of byte budgets is intentionally out of scope (HandleStore-only, per Add byte-size-aware budgeting and eviction to HandleStore #211).
One PR spans four issues by request; the handle_query move is verbatim but does enlarge the diff.

🤖 Generated with Claude Code

https://claude.ai/code/session_01HjetuNJX1EUJ2eVGfoLyHt

Generated by Claude Code

…t summaries Group the four context-firewall issues that share a single implementation path (payload measurement and the bounded Frame) into one change: - #207: add allocation-free firewall.estimated_size and use it for the raw-mode budget check instead of len(json.dumps(...)), so large payloads are no longer fully serialised just to be measured. - #211: add opt-in byte budgets to HandleStore (max_total_bytes evicts oldest-first; max_entry_bytes rejects an over-cap payload with the new HandleTooLarge error). Defaults stay None so behaviour is unchanged. - #218: ship firewall.make_tiktoken_counter implementing the TokenCounter seam with a real tokenizer (lazy import, cached encoder, configurable encoding), wired via BudgetManager(token_counter=...). - #174: stop averaging boolean columns in summaries (report true/false counts) and surface truncation with an explicit "N more facts omitted" marker. To stay within the shrink-only module-size ratchet, the HandleStore.expand query pipeline moves verbatim into a new handle_query module (also reduces the handles.py ceiling debt tracked by #188). Docs and CHANGELOG updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HjetuNJX1EUJ2eVGfoLyHt

Copilot

Pull request overview

This PR upgrades the context firewall and handle storage to better measure and bound tool-output payloads before they reach the LLM, adding an allocation-free size estimator, optional byte budgeting for HandleStore, a tiktoken-backed token counter, and more faithful summaries.

Changes:

Replace json.dumps-based sizing in the firewall with firewall.estimated_size() and export it as part of the firewall API.
Add optional HandleStore budgeting (max_total_bytes, max_entry_bytes) plus HandleTooLarge, and refactor handle expansion into expand_handle() in a new module.
Add make_tiktoken_counter() (lazy optional dependency) and improve summary fidelity (bool handling + explicit truncation marker), with docs/tests updates.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/test_token_counting_tiktoken.py	New tests for the `tiktoken` counter factory and BudgetManager wiring using an offline fake module.
tests/test_size_estimate.py	New tests for allocation-free size estimation, bounded error, and early-exit behavior.
tests/test_handles.py	Extends handle-store tests for byte budgeting, eviction, and config validation.
tests/test_firewall.py	Adds tests for boolean summarization and explicit truncation markers.
tests/test_doctests.py	Adds `firewall.size_estimate` to doctest coverage.
tests/test_architecture.py	Removes `handles.py` from the shrink-only size ratchet after refactor.
src/weaver_kernel/handles.py	Adds byte budgeting/accounting and delegates expansion to `handle_query.expand_handle()`.
src/weaver_kernel/handle_query.py	New module containing the handle expansion pipeline (validation → filter → paginate → project → redact → Frame).
src/weaver_kernel/firewall/transform.py	Uses `estimated_size()` instead of `json.dumps()` for raw-mode sizing checks.
src/weaver_kernel/firewall/token_counting_tiktoken.py	New `tiktoken`-backed token counter factory with lazy import and cached encoding.
src/weaver_kernel/firewall/summarize.py	Fixes bool-as-int averaging and adds explicit omission markers for truncated facts.
src/weaver_kernel/firewall/size_estimate.py	New iterative allocation-free payload size estimator with optional early exit.
src/weaver_kernel/firewall/init.py	Exports `estimated_size` and `make_tiktoken_counter`.
src/weaver_kernel/errors.py	Expands `BudgetConfigError` docstring and adds `HandleTooLarge`.
src/weaver_kernel/init.py	Re-exports `HandleTooLarge`.
pyproject.toml	Adds mypy ignore override for `tiktoken.*`.
docs/context_firewall.md	Documents size estimation, handle byte budgets, summary semantics, and `tiktoken` usage.
CHANGELOG.md	Adds an unreleased entry describing the grouped firewall improvements.

 def _summarize_dict(data: dict[str, Any], *, max_facts: int) -> list[str]:
    facts: list[str] = [f"Keys: {', '.join(sorted(data.keys())[:20])}"]
-    for k, v in list(data.items())[: max_facts - 1]:
+    for k, v in data.items():
        if isinstance(v, (int, float)):


 def _summarize_plain_list(data: list[Any], *, max_facts: int) -> list[str]:
    facts = [f"List of {len(data)} items"]
-    for item in data[: max_facts - 1]:
-        facts.append(repr(item)[:100])
-    return facts[:max_facts]
+    facts.extend(repr(item)[:100] for item in data)
+    return _truncate_facts(facts, max_facts)


+        size = estimated_size(data)
+        if self._max_entry_bytes is not None and size > self._max_entry_bytes:
+            raise HandleTooLarge(
+                f"Handle data for '{capability_id}' is ~{size} bytes, exceeding the "
+                f"per-entry cap of {self._max_entry_bytes} bytes; not stored."
+            )
+


+        elif isinstance(cur, dict):
+            n = len(cur)
+            total += _CONTAINER_DELIMS + max(0, n - 1) * _ITEM_SEP
+            for key, val in cur.items():
+                total += len(str(key)) + _QUOTES + _KV_SEP
+                stack.append(val)
+        elif isinstance(cur, (list, tuple)):
+            n = len(cur)
+            total += _CONTAINER_DELIMS + max(0, n - 1) * _ITEM_SEP
+            stack.extend(cur)


+    total = 0
+    stack: list[Any] = [value]
+    while stack:


…stimator, total-budget invariant - summarize: cap _summarize_dict/_summarize_plain_list iteration with islice and derive the omitted count from known length, so a huge dict/list no longer triggers an O(n) walk on the firewall hot path (defeating max_facts). - size_estimate: count each container at most once (cycle detection) so self-referential inputs terminate instead of hanging; correct the docstring's traversal-order claim (the total is order-independent). - handles: reject a single entry larger than the binding cap (max_entry_bytes or max_total_bytes) so current_bytes can never exceed max_total_bytes; use estimated_size(limit=...) to avoid fully walking an over-cap payload. - Docs/CHANGELOG and tests updated (cyclic-input, total-budget rejection). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HjetuNJX1EUJ2eVGfoLyHt

dgenio · 2026-06-22T07:59:03Z

Thanks — all five points were valid and are fixed in 25b26dd:

summarize.py (dict & plain-list O(n) on the hot path): both now iterate at most max_facts items via islice and derive the omitted count from the known length, so a huge payload no longer materialises one fact per item.
handles.py (store could stay over max_total_bytes): a single entry larger than the binding cap (max_entry_bytes or max_total_bytes) is now rejected with HandleTooLarge, so current_bytes can never exceed the budget. The size check uses estimated_size(..., limit=...) so an over-cap payload isn't fully walked. Added test_entry_larger_than_total_budget_is_rejected / test_total_budget_never_exceeded_after_stores.
size_estimate.py (cyclic input): the walk now records each container's id and visits it once, so self-referential structures terminate instead of hanging (covered by new cyclic-input tests).
size_estimate.py (traversal-order docstring): the total is order-independent, so I corrected the docstring rather than reverse the push order (the early-exit value is documented as "only guaranteed > limit").

make ci is green locally (766 passed).

Generated by Claude Code

Copilot AI review requested due to automatic review settings June 22, 2026 07:48

Copilot started reviewing on behalf of dgenio June 22, 2026 07:48 View session

Copilot AI reviewed Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(firewall): payload sizing, byte budgets, tiktoken counter, honest summaries#237

feat(firewall): payload sizing, byte budgets, tiktoken counter, honest summaries#237
dgenio wants to merge 2 commits into
mainfrom
claude/github-issue-triage-nfi1ep

dgenio commented Jun 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

dgenio commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dgenio commented Jun 22, 2026

What changed

Why

How verified

Risks / caveats

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

dgenio commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants