Skip to content

feat(pulse/telegram): bidirectional image support#1384

Open
klausagnoletti wants to merge 1 commit into
danielmiessler:mainfrom
klausagnoletti:feat/telegram-bidirectional-images
Open

feat(pulse/telegram): bidirectional image support#1384
klausagnoletti wants to merge 1 commit into
danielmiessler:mainfrom
klausagnoletti:feat/telegram-bidirectional-images

Conversation

@klausagnoletti

Copy link
Copy Markdown

Pulse Telegram bridge: bidirectional image support

Summary

The Pulse Telegram module (PAI/PULSE/modules/telegram.ts) is text-only today: a
single bot.on("message:text") handler, no media in or out. This adds bidirectional
image support
so you can send your DA a photo or PNG/JPEG from your phone and have it
actually look at the image, and so the DA can send images back. Implemented and running
in production on my instance.

Motivation

The most natural thing to do from a phone is send a screenshot or a photo. Today the
bridge silently drops any non-text message (the message:photo / message:document
update has no handler, so grammY ignores it and the user gets no reply). Outbound, the
DA can generate or fetch an image but has no way to deliver it.

Design

Inbound (photo or PNG/JPEG document)

A new bot.on(["message:photo", "message:document"]) handler downloads the file, then
hands the local path to the SDK session, which uses the Read tool (vision) to view it.
Captions flow through and are sanitised + injection-scanned exactly like text messages.

Because the saved path is handed to a bypassPermissions session, inbound bytes are
validated before they ever reach the decoder:

  • PNG and JPEG only, identified by magic bytes, never the client-declared
    mime_type (which is attacker-controllable).
  • Everything else is rejected: WebP, GIF, SVG, voice, audio, video, stickers, and all
    non-image documents. Refusing to decode those formats is what retires the libwebp
    (CVE-2023-4863) and SVG-script attack classes: a decoder you never invoke can't hurt you.
  • 10 MB byte cap, plus a 40 MP / 10000 px pre-decode dimension cap. Documents are
    not re-encoded by Telegram, so a tiny file can claim a gigapixel canvas (decompression
    bomb); the dimension cap is read cheaply from the PNG IHDR / JPEG SOF header before any
    full decode or allocation.
  • Saved under a dedicated state/telegram/incoming/ dir with a randomUUID() filename
    and the extension derived from the sniffed type (never the remote path). Best-effort
    deleted after the session reads it.

Outbound (DA to user)

The DA includes a line [[IMG:/absolute/path]] (or [[IMG:https://url]]) anywhere in
its reply. The bridge extracts the refs, strips the tags from the visible text (including
during the live streaming edits), and sends each as a photo or document. Documented to the
model via two lines added to the existing TELEGRAM MODE OVERRIDE system-prompt block.

Refactor (no behaviour change to text)

To avoid duplicating ~140 lines of SDK/stream/reply logic across the text and image
handlers, the body of the existing message:text handler is extracted verbatim into a
shared processPrompt(ctx, { userLog, newMessage }). Both handlers call it. The text path
is unchanged: history-building, session-resume, billing key-strip, timeout, chunking, and
persistence all identical.

Security boundary

No change to the trust boundary. The allowed_users middleware still gates every update,
including media. The threat model is a compromised owner account or an owner forwarding a
malicious file, not the open internet.

Patch

One file changed (Releases/v5.0.0/.claude/PAI/PULSE/modules/telegram.ts), +214 / -47,
parses clean (bun build). The PNG/JPEG sniff and dimension logic is unit-tested against
real files and a synthetic decompression-bomb header. Tested and running in production on
my instance: photos, captioned photos, PNG/JPEG documents, and outbound generated images
all round-trip.

Residual risk (stated plainly)

Decoding any image still invokes the host image decoder, so libpng/libjpeg memory-safety
exposure remains on attacker-influenced (compromised-owner) input. It is bounded by the
magic-byte allowlist, the 10 MB / 40 MP / 10000 px pre-decode caps, the sandboxed
Read-only consumer, and the single-user access gate. The libwebp and SVG classes are fully
retired by never decoding those formats. Operators handling untrusted forwards should keep
host image libraries patched.

Out of scope (deliberately)

  • Voice / audio / video inbound. Those need a transcription/frame pipeline (Whisper, ffmpeg)
    the bridge doesn't have; a focused follow-up PR is the right home for them.
  • A separate one-line robustness fix I run locally (passing pathToClaudeCodeExecutable to
    the SDK so it works on non-native installs) is NOT in this diff; filable separately if useful.

Inbound: new message:photo / message:document handler. The file is downloaded,
validated, and its path passed to the SDK session to Read.

Security (single-user gated bridge; files go to a bypassPermissions session):
- Accept PNG/JPEG only, identified by MAGIC BYTES, never the declared mime.
- Reject WebP/GIF/SVG/voice/audio/video/stickers and all other documents, which
  retires the libwebp (CVE-2023-4863) and SVG-script decoder classes.
- 10MB byte cap + 40MP / 10000px pre-decode dimension cap (decompression bombs).
- UUID filename from the sniffed type; sandboxed incoming dir; cleanup after read.

Outbound: the DA emits [[IMG:/abs/path]] or [[IMG:https://url]]; the bridge sends
it (photo or document) and strips the tag from the text, including mid-stream.

Refactor: the message:text handler body is extracted verbatim into a shared
processPrompt() reused by both handlers; the text path is unchanged. allowed_users
gating and caption injection-scanning preserved.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant