AssemblyAI · alexkroman · Jun 18, 2026 · Jun 18, 2026
diff --git a/README.md b/README.md
@@ -36,7 +36,7 @@ That's it. Run `assembly onboard` for a guided tour, or see [Installation](#-ins
 - **🎯 One command for everything**: transcription, real-time streaming, voice agents, LLM prompts, and WER benchmarking — no SDK boilerplate.
 - **🔌 Built for pipelines**: data goes to stdout, errors to stderr, `--json` gives stable machine-readable output, and `-` reads audio from stdin.
 - **🔐 Secure by default**: your API key lives in the OS keyring, never in a dotfile — and run commands have no `--api-key` flag, so keys can't leak into `ps` or shell history.
-- **🛠️ From demo to deployed app**: `assembly init` scaffolds a runnable FastAPI starter, `assembly dev` / `share` / `deploy` run, tunnel, and ship it, and `--show-code` prints the equivalent Python SDK script for any run command (`transcribe` / `stream` / `agent` / `agent-cascade`).
+- **🛠️ From demo to deployed app**: `assembly init` scaffolds a runnable FastAPI starter, `assembly dev` / `share` / `deploy` run, tunnel, and ship it, and `--show-code` prints the equivalent Python SDK script for any run command (`transcribe` / `stream` / `agent` / `live`).
 - **🤖 Agent-ready**: `assembly setup install` wires your coding agent up with the AssemblyAI docs MCP server and skills.
 - **📖 Open source**: MIT licensed.
 
@@ -48,7 +48,7 @@ That's it. Run `assembly onboard` for a guided tour, or see [Installation](#-ins
 | `assembly stream` | Real-time transcription from your microphone, a file, or a URL — on macOS it can capture system audio too |
 | `assembly dictate` | Signal-driven dictation: records immediately, send SIGTERM for instant text — scriptable from hotkey tools like Hammerspoon (Sync STT API, up to 120 s per utterance) |
 | `assembly agent` | Full-duplex spoken conversation with a voice agent, right in your terminal |
-| `assembly agent-cascade` | Same live conversation, but wired client-side from Streaming STT + the LLM Gateway + streaming TTS, like the `agent-cascade` starter (sandbox-only) |
+| `assembly live` | Talk live to a tool-using voice agent, wired client-side from Streaming STT + a deepagents brain on the LLM Gateway + streaming TTS — it can web-search, fetch URLs, and read the docs mid-conversation, like the `agent-cascade` starter (sandbox-only) |
 | `assembly speak` | Synthesize text to speech over the streaming-TTS WebSocket (sandbox-only) |
 | `assembly llm` | Prompt the LLM Gateway over a transcript, files, stdin, or a live stream |
 | `assembly code` | Terminal coding agent (deepagents SDK) backed only by the LLM Gateway — reads/writes/edits files, runs shell, searches the docs MCP, and can invoke the `assembly` CLI itself; mutating actions ask for approval. Defaults to voice in a terminal (speak your request, replies read back via streaming TTS in the sandbox); pass `--no-voice` for the keyboard TUI |
@@ -63,7 +63,7 @@ That's it. Run `assembly onboard` for a guided tour, or see [Installation](#-ins
 | `assembly transcripts` / `sessions` | Browse and fetch past transcripts and streaming sessions |
 | `assembly keys` / `balance` / `usage` / `limits` / `audit` | Account self-service via browser login |
 
-Add `--show-code` to `transcribe` / `stream` / `agent` / `agent-cascade` to print the equivalent Python SDK script instead of running — the built-in path from CLI experiment to SDK code.
+Add `--show-code` to `transcribe` / `stream` / `agent` / `live` to print the equivalent Python SDK script instead of running — the built-in path from CLI experiment to SDK code.
 
 ## ✨ Things you can do with it
 
@@ -194,7 +194,7 @@ assembly transcripts list --json --limit 5 \
 assembly agent --voice ivy --system-prompt "you're a helpful interviewer"
 ```
 
-**Graduate to the SDK** — `--show-code` prints the equivalent Python script for any `transcribe`/`stream`/`agent`/`agent-cascade` run instead of executing it:
+**Graduate to the SDK** — `--show-code` prints the equivalent Python script for any `transcribe`/`stream`/`agent`/`live` run instead of executing it:
 
 ```sh
 assembly agent --system-prompt "you're a story generator" --show-code > story.py

diff --git a/REFERENCE.md b/REFERENCE.md
@@ -94,7 +94,7 @@ each carrying a `"type"` field to dispatch on:
 | ------- | ----------- |
 | `assembly stream --json` | `begin`, `turn`, `termination` (with `--from-stdin`, a `source` event precedes each file's events) |
 | `assembly agent --json` | `session.ready`, `transcript.user.delta`, `transcript.user`, `reply.started`, `transcript.agent`, `reply.done` |
-| `assembly agent-cascade --json` | `session.ready`, `transcript.user.delta`, `transcript.user`, `reply.started`, `transcript.agent`, `reply.done` |
+| `assembly live --json` | `session.ready`, `transcript.user.delta`, `transcript.user`, `reply.started`, `transcript.agent`, `reply.done` |
 | `assembly dictate --json` | `utterance` |
 | `assembly llm --follow --json` | `answer` |
 | `assembly transcribe <batch> --json` | `result` (one per source), then `reduce` if `--llm-reduce` is set |

diff --git a/aai_cli/agent_cascade/brain.py b/aai_cli/agent_cascade/brain.py
@@ -0,0 +1,136 @@
+"""Deepagents-powered reply brain for the live voice cascade.
+
+`assembly live` answers each spoken turn with a deepagents graph instead of a single
+LLM completion, so the agent can transparently reach for tools — web search, URL
+fetch, the AssemblyAI docs — mid-conversation, mimicking a live multimodal assistant
+(the "talk to Gemini Live" experience). The graph is built once per session
+(:func:`build_graph`) and invoked statelessly per turn with the running history the
+cascade already keeps (:func:`build_completer`); tools are read-only and auto-approved,
+because a spoken turn can't pause for a keyboard confirmation, and the system prompt
+keeps every reply short and speakable.
+
+The graph is the only network seam: :func:`build_completer` accepts an injected graph,
+so the per-turn orchestration is unit-tested against a fake with no sockets — the same
+seam the rest of the cascade uses for its STT/LLM/TTS legs.
+"""
+
+from __future__ import annotations
+
+from collections.abc import Callable, Sequence
+from typing import TYPE_CHECKING
+
+from aai_cli.agent_cascade.config import CascadeConfig
+from aai_cli.code_agent.agent import CompiledAgent
+
+if TYPE_CHECKING:
+    from langchain_core.tools import BaseTool
+    from openai.types.chat import ChatCompletionMessageParam
+
+# Appended to the user's persona so the model knows it has tools and must keep replies
+# spoken. The cascade's plain-LLM persona (CascadeConfig.system_prompt) says nothing
+# about tools, so without this the agent would never reach for web search.
+_TOOL_GUIDANCE = (
+    "You can use tools to help answer: search the web for current or unfamiliar facts, "
+    "fetch a specific URL, and look up the AssemblyAI documentation. Reach for a tool "
+    "when a question needs fresh or external information; answer directly and instantly "
+    "when you already know. Your reply is read aloud, so keep it short and spoken — no "
+    "markdown, lists, code, or raw URLs."
+)
+
+
+def build_system_prompt(persona: str) -> str:
+    """The live agent's system prompt: the user's persona plus the tool guidance."""
+    return f"{persona}\n\n{_TOOL_GUIDANCE}"
+
+
+def build_live_tools() -> list[BaseTool]:
+    """The live agent's read-only toolset: URL fetch, web search (if keyed), and docs.
+
+    All three are reused from the coding agent's tool modules. Unlike there they are
+    *not* approval-gated — a spoken turn can't wait for a keyboard confirmation, so the
+    live agent only gets read-only tools and runs them automatically. Web search is
+    present only when ``TAVILY_API_KEY`` is set; the docs MCP is best-effort (an empty
+    list when the host is unreachable), so neither blocks a session.
+    """
+    from aai_cli.code_agent.docs_mcp import load_docs_tools
+    from aai_cli.code_agent.fetch_tool import build_fetch_tool
+    from aai_cli.code_agent.web_search import build_web_search_tool
+
+    tools: list[BaseTool] = [build_fetch_tool()]
+    search = build_web_search_tool()
+    if search is not None:
+        tools.append(search)
+    tools.extend(load_docs_tools())
+    return tools
+
+
+def build_graph(
+    api_key: str, config: CascadeConfig, *, tools: Sequence[BaseTool] | None = None
+) -> CompiledAgent:
+    """Compile the deepagents graph for one live session over the gateway model.
+
+    Reuses the coding agent's gateway-bound ``ChatOpenAI`` (so the live agent can only
+    ever reach AssemblyAI), threading the cascade's ``--max-tokens``/``--llm-config``
+    through it. ``tools`` defaults to :func:`build_live_tools`; tests pass an explicit
+    (possibly empty) list to skip the network-touching docs probe.
+    """
+    from deepagents import create_deep_agent
+
+    from aai_cli.code_agent.model import build_model
+
+    model = build_model(
+        api_key, model=config.model, max_tokens=config.max_tokens, extra=config.llm_extra
+    )
+    resolved = build_live_tools() if tools is None else list(tools)
+    return create_deep_agent(
+        model=model, tools=resolved, system_prompt=build_system_prompt(config.system_prompt)
+    )
+
+
+def build_completer(
+    api_key: str, config: CascadeConfig, *, graph: CompiledAgent | None = None
+) -> Callable[[list[ChatCompletionMessageParam]], str]:
+    """A ``complete_reply`` for the cascade engine backed by the deepagents graph.
+
+    The cascade prepends its own ``system`` message to the history each turn; the graph
+    already owns the system prompt, so we drop it before invoking. The graph runs the
+    full tool loop and we return its final spoken text. ``graph`` is injected in tests
+    so the per-turn wiring runs against a fake with no network.
+    """
+    resolved = build_graph(api_key, config) if graph is None else graph
+
+    def complete_reply(messages: list[ChatCompletionMessageParam]) -> str:
+        conversation = [message for message in messages if message.get("role") != "system"]
+        return _reply_text(resolved.invoke({"messages": conversation}))
+
+    return complete_reply
+
+
+def _reply_text(result: dict[str, object]) -> str:
+    """The agent's final spoken reply: the last assistant message that carries text.
+
+    A tool-using turn ends in an ``AIMessage`` whose ``content`` is the spoken answer,
+    but earlier ``AIMessage``\\s in the same turn (the tool-call requests) have empty
+    text — so we scan from the end for the last one with non-empty content.
+    """
+    messages = result.get("messages")
+    if not isinstance(messages, list):
+        return ""
+    for message in reversed(messages):
+        if type(message).__name__ != "AIMessage":
+            continue
+        text = _content_text(getattr(message, "content", "")).strip()
+        if text:
+            return text
+    return ""
+
+
+def _content_text(content: object) -> str:
+    """Coerce a message's content (a string, or a list of content blocks) to plain text."""
+    if isinstance(content, str):
+        return content
+    if isinstance(content, list):
+        return "".join(
+            block.get("text", "") if isinstance(block, dict) else str(block) for block in content
+        )
+    return str(content)
diff --git a/aai_cli/agent_cascade/engine.py b/aai_cli/agent_cascade/engine.py
@@ -18,9 +18,10 @@
 from dataclasses import dataclass, field
 from typing import TYPE_CHECKING, Protocol
 
+from aai_cli.agent_cascade import brain
 from aai_cli.agent_cascade.config import CascadeConfig
 from aai_cli.agent_cascade.text import split_sentences, trim_history
-from aai_cli.core import client, llm
+from aai_cli.core import client
 from aai_cli.core.errors import CLIError
 from aai_cli.tts import session as tts_session
 from aai_cli.tts.session import SpeakConfig
@@ -121,15 +122,9 @@ def real(
         def run_stt(on_turn: Callable[[object], None]) -> None:
             client.stream_audio(api_key, audio, params=stt_params, on_turn=on_turn)
 
-        def complete_reply(messages: list[ChatCompletionMessageParam]) -> str:
-            response = llm.complete(
-                api_key,
-                model=config.model,
-                messages=messages,
-                max_tokens=config.max_tokens,
-                extra=dict(config.llm_extra) or None,
-            )
-            return llm.content_of(response)
+        # The LLM leg is a deepagents graph (web search / URL fetch / docs tools), not a
+        # single completion, so a spoken turn can transparently use tools.
+        complete_reply = brain.build_completer(api_key, config)
 
         def synthesize(text: str) -> bytes:
             spec = SpeakConfig(

diff --git a/aai_cli/code_agent/model.py b/aai_cli/code_agent/model.py
@@ -8,6 +8,7 @@
 
 from __future__ import annotations
 
+from collections.abc import Mapping
 from typing import TYPE_CHECKING
 
 from aai_cli.core import environments
@@ -37,14 +38,26 @@ def _flatten_content(messages: object) -> None:
             )
 
 
-def build_model(api_key: str, *, model: str) -> BaseChatModel:
+def build_model(
+    api_key: str,
+    *,
+    model: str,
+    max_tokens: int | None = None,
+    extra: Mapping[str, object] | None = None,
+) -> BaseChatModel:
     """A ChatOpenAI bound to the active environment's LLM Gateway.
 
     ``use_responses_api=False`` keeps it on the chat-completions endpoint the gateway
     implements (the same one `aai_cli.core.llm` uses), rather than the OpenAI
     Responses API that langchain would otherwise prefer for ``openai:`` models. The
     subclass also flattens content-parts arrays the gateway rejects (see
     :func:`_flatten_content`).
+
+    ``max_tokens`` caps the per-reply length (the live voice agent passes a small cap to
+    keep spoken replies short and fast); ``extra`` passes any additional gateway request
+    fields through as ``extra_body`` (so they reach the request body verbatim, like
+    `aai_cli.core.llm`'s ``extra``). Both default to off so the coding agent's call is
+    unchanged.
     """
     from langchain_openai import ChatOpenAI
     from pydantic import SecretStr
@@ -64,4 +77,6 @@ def _get_request_payload(
         base_url=environments.active().llm_gateway_base,
         api_key=SecretStr(api_key),
         use_responses_api=False,
+        max_tokens=max_tokens,
+        extra_body=dict(extra) if extra else None,
     )
diff --git a/aai_cli/code_gen/agent_cascade.py b/aai_cli/code_gen/agent_cascade.py
@@ -16,9 +16,11 @@
 # which is never formatted — so no brace has to be doubled.
 _HEADER = """\
 # Live voice cascade: Streaming STT -> LLM Gateway -> streaming TTS, wired client-side.
-# This is what `assembly --sandbox agent-cascade` runs: it transcribes your speech,
+# The basic cascade behind `assembly --sandbox live`: it transcribes your speech,
 # sends each finalized turn to the LLM Gateway, and speaks the reply through streaming
 # TTS — the same three primitives the agent-cascade init template wires server-side.
+# (The `live` command adds a tool-using agent on the LLM leg; this snippet is the
+# plain single-completion version to build from.)
 # Requires audio + websockets:  pip install sounddevice websockets openai
 # Tip: use headphones — the mic stays open while the agent speaks, so on speakers it
 # would hear itself and loop.

diff --git a/aai_cli/commands/agent/__init__.py b/aai_cli/commands/agent/__init__.py
@@ -84,7 +84,7 @@ def agent(
         help="Print the equivalent Python SDK code and exit (does not start a session)",
     ),
 ) -> None:
-    """Hold a live two-way voice conversation with a voice agent
+    """Hold a live two-way voice conversation with the Voice Agent API
 
     Use headphones: the mic stays open while the agent speaks, so on
     speakers it would hear itself and loop. Pass an audio file/URL (or

diff --git a/aai_cli/commands/agent_cascade/__init__.py b/aai_cli/commands/agent_cascade/__init__.py
@@ -31,7 +31,7 @@
 SPEC = command_registry.CommandModuleSpec(
     panel=help_panels.TRANSCRIPTION,
     order=45,  # pragma: no mutate -- sparse rank; a +-1 shift is order-equivalent
-    commands=("agent-cascade",),
+    commands=("live",),
 )
 
 
@@ -43,28 +43,28 @@ def _emit_voice_list(_state: AppState, json_mode: bool) -> None:
 
 
 @app.command(
-    name="agent-cascade",
+    name="live",
     rich_help_panel=help_panels.TRANSCRIPTION,
     epilog=examples_epilog(
         [
-            ("Start a live cascade conversation", "assembly --sandbox agent-cascade"),
+            ("Start a live voice conversation", "assembly --sandbox live"),
             (
                 "Pick a voice and opening line",
-                'assembly --sandbox agent-cascade --voice michael --greeting "Hi there"',
+                'assembly --sandbox live --voice michael --greeting "Hi there"',
             ),
             (
                 "Give the agent a persona",
-                'assembly --sandbox agent-cascade --system-prompt "You are a terse pirate."',
+                'assembly --sandbox live --system-prompt "You are a terse pirate."',
             ),
-            ("See available voices", "assembly --sandbox agent-cascade --list-voices"),
+            ("See available voices", "assembly --sandbox live --list-voices"),
             (
                 "Print equivalent Python instead of running",
-                "assembly --sandbox agent-cascade --show-code",
+                "assembly --sandbox live --show-code",
             ),
         ]
     ),
 )
-def agent_cascade(
+def live(
     ctx: typer.Context,
     source: str | None = typer.Argument(
         None, help="Audio file path or URL to speak to the agent. Omit to use the microphone."
@@ -169,14 +169,15 @@ def agent_cascade(
         help="Print the equivalent Python SDK code and exit (does not start a session)",
     ),
 ) -> None:
-    """\\[sandbox] Hold a live voice conversation through a self-wired cascade
+    """\\[sandbox] Talk live to a tool-using voice agent
 
-    Like 'assembly agent', but instead of AssemblyAI's Voice Agent endpoint this
-    wires the three primitives together itself — Streaming STT, the LLM Gateway,
-    and streaming TTS — exactly like the 'agent-cascade' init template does
-    server-side. Because it uses streaming TTS it only runs in the sandbox: run
-    it as 'assembly --sandbox agent-cascade' (--sandbox goes before the
-    subcommand).
+    A real-time spoken conversation, wired client-side from three primitives —
+    Streaming STT, a deepagents brain on the LLM Gateway, and streaming TTS. Unlike
+    'assembly agent' (the Voice Agent API), the brain here is an agent that can use
+    tools mid-conversation — web search, URL fetch, and the AssemblyAI docs — so it
+    answers like a live multimodal assistant. Because it uses streaming TTS it only
+    runs in the sandbox: run it as 'assembly --sandbox live' (--sandbox goes before
+    the subcommand).
 
     Use headphones: the mic stays open while the agent speaks, so on speakers it
     would hear itself and loop. Pass an audio file/URL (or --sample) to speak a
@@ -185,6 +186,9 @@ def agent_cascade(
 
     This only runs a conversation in the terminal — it writes no code. To build
     an agent-cascade app, run 'assembly init agent-cascade' instead.
+
+    Web search needs a TAVILY_API_KEY in the environment; without it the agent
+    keeps its URL-fetch and docs tools.
     """
 
     if list_voices:

diff --git a/aai_cli/commands/agent_cascade/_exec.py b/aai_cli/commands/agent_cascade/_exec.py
@@ -169,9 +169,9 @@ def _print_show_code(opts: AgentCascadeOptions, system_prompt_text: str) -> None
 def run_agent_cascade(opts: AgentCascadeOptions, state: AppState, *, json_mode: bool) -> None:
     """Execute one `assembly agent-cascade` cascade from already-parsed flags."""
     text_mode, json_mode = resolve_output_modes(opts.output_field, json_mode=json_mode)
-    validate_voice(opts.voice, voices.VOICE_NAMES, command="agent-cascade")
+    validate_voice(opts.voice, voices.VOICE_NAMES, command="live")
     # Streaming TTS has no production host, so the whole cascade is sandbox-only.
-    tts_session.require_available("agent-cascade")
+    tts_session.require_available("live")
     system_prompt_text = _resolve_system_prompt(opts.system_prompt, opts.system_prompt_file)
 
     if opts.show_code: