Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ That's it. Run `assembly onboard` for a guided tour, or see [Installation](#-ins
- **🎯 One command for everything**: transcription, real-time streaming, voice agents, LLM prompts, and WER benchmarking — no SDK boilerplate.
- **🔌 Built for pipelines**: data goes to stdout, errors to stderr, `--json` gives stable machine-readable output, and `-` reads audio from stdin.
- **🔐 Secure by default**: your API key lives in the OS keyring, never in a dotfile — and run commands have no `--api-key` flag, so keys can't leak into `ps` or shell history.
- **🛠️ From demo to deployed app**: `assembly init` scaffolds a runnable FastAPI starter, `assembly dev` / `share` / `deploy` run, tunnel, and ship it, and `--show-code` prints the equivalent Python SDK script for any run command (`transcribe` / `stream` / `agent` / `agent-cascade`).
- **🛠️ From demo to deployed app**: `assembly init` scaffolds a runnable FastAPI starter, `assembly dev` / `share` / `deploy` run, tunnel, and ship it, and `--show-code` prints the equivalent Python SDK script for any run command (`transcribe` / `stream` / `agent` / `live`).
- **🤖 Agent-ready**: `assembly setup install` wires your coding agent up with the AssemblyAI docs MCP server and skills.
- **📖 Open source**: MIT licensed.

Expand All @@ -48,7 +48,7 @@ That's it. Run `assembly onboard` for a guided tour, or see [Installation](#-ins
| `assembly stream` | Real-time transcription from your microphone, a file, or a URL — on macOS it can capture system audio too |
| `assembly dictate` | Signal-driven dictation: records immediately, send SIGTERM for instant text — scriptable from hotkey tools like Hammerspoon (Sync STT API, up to 120 s per utterance) |
| `assembly agent` | Full-duplex spoken conversation with a voice agent, right in your terminal |
| `assembly agent-cascade` | Same live conversation, but wired client-side from Streaming STT + the LLM Gateway + streaming TTS, like the `agent-cascade` starter (sandbox-only) |
| `assembly live` | Talk live to a tool-using voice agent, wired client-side from Streaming STT + a deepagents brain on the LLM Gateway + streaming TTS — it can web-search, fetch URLs, and read the docs mid-conversation, like the `agent-cascade` starter (sandbox-only) |
| `assembly speak` | Synthesize text to speech over the streaming-TTS WebSocket (sandbox-only) |
| `assembly llm` | Prompt the LLM Gateway over a transcript, files, stdin, or a live stream |
| `assembly code` | Terminal coding agent (deepagents SDK) backed only by the LLM Gateway — reads/writes/edits files, runs shell, searches the docs MCP, and can invoke the `assembly` CLI itself; mutating actions ask for approval. Defaults to voice in a terminal (speak your request, replies read back via streaming TTS in the sandbox); pass `--no-voice` for the keyboard TUI |
Expand All @@ -63,7 +63,7 @@ That's it. Run `assembly onboard` for a guided tour, or see [Installation](#-ins
| `assembly transcripts` / `sessions` | Browse and fetch past transcripts and streaming sessions |
| `assembly keys` / `balance` / `usage` / `limits` / `audit` | Account self-service via browser login |

Add `--show-code` to `transcribe` / `stream` / `agent` / `agent-cascade` to print the equivalent Python SDK script instead of running — the built-in path from CLI experiment to SDK code.
Add `--show-code` to `transcribe` / `stream` / `agent` / `live` to print the equivalent Python SDK script instead of running — the built-in path from CLI experiment to SDK code.

## ✨ Things you can do with it

Expand Down Expand Up @@ -194,7 +194,7 @@ assembly transcripts list --json --limit 5 \
assembly agent --voice ivy --system-prompt "you're a helpful interviewer"
```

**Graduate to the SDK** — `--show-code` prints the equivalent Python script for any `transcribe`/`stream`/`agent`/`agent-cascade` run instead of executing it:
**Graduate to the SDK** — `--show-code` prints the equivalent Python script for any `transcribe`/`stream`/`agent`/`live` run instead of executing it:

```sh
assembly agent --system-prompt "you're a story generator" --show-code > story.py
Expand Down
2 changes: 1 addition & 1 deletion REFERENCE.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ each carrying a `"type"` field to dispatch on:
| ------- | ----------- |
| `assembly stream --json` | `begin`, `turn`, `termination` (with `--from-stdin`, a `source` event precedes each file's events) |
| `assembly agent --json` | `session.ready`, `transcript.user.delta`, `transcript.user`, `reply.started`, `transcript.agent`, `reply.done` |
| `assembly agent-cascade --json` | `session.ready`, `transcript.user.delta`, `transcript.user`, `reply.started`, `transcript.agent`, `reply.done` |
| `assembly live --json` | `session.ready`, `transcript.user.delta`, `transcript.user`, `reply.started`, `transcript.agent`, `reply.done` |
| `assembly dictate --json` | `utterance` |
| `assembly llm --follow --json` | `answer` |
| `assembly transcribe <batch> --json` | `result` (one per source), then `reduce` if `--llm-reduce` is set |
Expand Down
136 changes: 136 additions & 0 deletions aai_cli/agent_cascade/brain.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
"""Deepagents-powered reply brain for the live voice cascade.

`assembly live` answers each spoken turn with a deepagents graph instead of a single
LLM completion, so the agent can transparently reach for tools — web search, URL
fetch, the AssemblyAI docs — mid-conversation, mimicking a live multimodal assistant
(the "talk to Gemini Live" experience). The graph is built once per session
(:func:`build_graph`) and invoked statelessly per turn with the running history the
cascade already keeps (:func:`build_completer`); tools are read-only and auto-approved,
because a spoken turn can't pause for a keyboard confirmation, and the system prompt
keeps every reply short and speakable.

The graph is the only network seam: :func:`build_completer` accepts an injected graph,
so the per-turn orchestration is unit-tested against a fake with no sockets — the same
seam the rest of the cascade uses for its STT/LLM/TTS legs.
"""

from __future__ import annotations

from collections.abc import Callable, Sequence
from typing import TYPE_CHECKING

from aai_cli.agent_cascade.config import CascadeConfig
from aai_cli.code_agent.agent import CompiledAgent

if TYPE_CHECKING:
from langchain_core.tools import BaseTool
from openai.types.chat import ChatCompletionMessageParam

# Appended to the user's persona so the model knows it has tools and must keep replies
# spoken. The cascade's plain-LLM persona (CascadeConfig.system_prompt) says nothing
# about tools, so without this the agent would never reach for web search.
_TOOL_GUIDANCE = (
"You can use tools to help answer: search the web for current or unfamiliar facts, "
"fetch a specific URL, and look up the AssemblyAI documentation. Reach for a tool "
"when a question needs fresh or external information; answer directly and instantly "
"when you already know. Your reply is read aloud, so keep it short and spoken — no "
"markdown, lists, code, or raw URLs."
)


def build_system_prompt(persona: str) -> str:
"""The live agent's system prompt: the user's persona plus the tool guidance."""
return f"{persona}\n\n{_TOOL_GUIDANCE}"


def build_live_tools() -> list[BaseTool]:
"""The live agent's read-only toolset: URL fetch, web search (if keyed), and docs.

All three are reused from the coding agent's tool modules. Unlike there they are
*not* approval-gated — a spoken turn can't wait for a keyboard confirmation, so the
live agent only gets read-only tools and runs them automatically. Web search is
present only when ``TAVILY_API_KEY`` is set; the docs MCP is best-effort (an empty
list when the host is unreachable), so neither blocks a session.
"""
from aai_cli.code_agent.docs_mcp import load_docs_tools
from aai_cli.code_agent.fetch_tool import build_fetch_tool
from aai_cli.code_agent.web_search import build_web_search_tool

tools: list[BaseTool] = [build_fetch_tool()]
search = build_web_search_tool()
if search is not None:
tools.append(search)
tools.extend(load_docs_tools())
return tools


def build_graph(
api_key: str, config: CascadeConfig, *, tools: Sequence[BaseTool] | None = None
) -> CompiledAgent:
"""Compile the deepagents graph for one live session over the gateway model.

Reuses the coding agent's gateway-bound ``ChatOpenAI`` (so the live agent can only
ever reach AssemblyAI), threading the cascade's ``--max-tokens``/``--llm-config``
through it. ``tools`` defaults to :func:`build_live_tools`; tests pass an explicit
(possibly empty) list to skip the network-touching docs probe.
"""
from deepagents import create_deep_agent

from aai_cli.code_agent.model import build_model

model = build_model(
api_key, model=config.model, max_tokens=config.max_tokens, extra=config.llm_extra
)
resolved = build_live_tools() if tools is None else list(tools)
return create_deep_agent(
model=model, tools=resolved, system_prompt=build_system_prompt(config.system_prompt)
)


def build_completer(
api_key: str, config: CascadeConfig, *, graph: CompiledAgent | None = None
) -> Callable[[list[ChatCompletionMessageParam]], str]:
"""A ``complete_reply`` for the cascade engine backed by the deepagents graph.

The cascade prepends its own ``system`` message to the history each turn; the graph
already owns the system prompt, so we drop it before invoking. The graph runs the
full tool loop and we return its final spoken text. ``graph`` is injected in tests
so the per-turn wiring runs against a fake with no network.
"""
resolved = build_graph(api_key, config) if graph is None else graph

def complete_reply(messages: list[ChatCompletionMessageParam]) -> str:
conversation = [message for message in messages if message.get("role") != "system"]
return _reply_text(resolved.invoke({"messages": conversation}))

return complete_reply


def _reply_text(result: dict[str, object]) -> str:
"""The agent's final spoken reply: the last assistant message that carries text.

A tool-using turn ends in an ``AIMessage`` whose ``content`` is the spoken answer,
but earlier ``AIMessage``\\s in the same turn (the tool-call requests) have empty
text — so we scan from the end for the last one with non-empty content.
"""
messages = result.get("messages")
if not isinstance(messages, list):
return ""
for message in reversed(messages):
if type(message).__name__ != "AIMessage":
continue
text = _content_text(getattr(message, "content", "")).strip()
if text:
return text
return ""


def _content_text(content: object) -> str:
"""Coerce a message's content (a string, or a list of content blocks) to plain text."""
if isinstance(content, str):
return content
if isinstance(content, list):
return "".join(
block.get("text", "") if isinstance(block, dict) else str(block) for block in content
)
return str(content)
15 changes: 5 additions & 10 deletions aai_cli/agent_cascade/engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,10 @@
from dataclasses import dataclass, field
from typing import TYPE_CHECKING, Protocol

from aai_cli.agent_cascade import brain
from aai_cli.agent_cascade.config import CascadeConfig
from aai_cli.agent_cascade.text import split_sentences, trim_history
from aai_cli.core import client, llm
from aai_cli.core import client
from aai_cli.core.errors import CLIError
from aai_cli.tts import session as tts_session
from aai_cli.tts.session import SpeakConfig
Expand Down Expand Up @@ -121,15 +122,9 @@ def real(
def run_stt(on_turn: Callable[[object], None]) -> None:
client.stream_audio(api_key, audio, params=stt_params, on_turn=on_turn)

def complete_reply(messages: list[ChatCompletionMessageParam]) -> str:
response = llm.complete(
api_key,
model=config.model,
messages=messages,
max_tokens=config.max_tokens,
extra=dict(config.llm_extra) or None,
)
return llm.content_of(response)
# The LLM leg is a deepagents graph (web search / URL fetch / docs tools), not a
# single completion, so a spoken turn can transparently use tools.
complete_reply = brain.build_completer(api_key, config)

def synthesize(text: str) -> bytes:
spec = SpeakConfig(
Expand Down
17 changes: 16 additions & 1 deletion aai_cli/code_agent/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

from __future__ import annotations

from collections.abc import Mapping
from typing import TYPE_CHECKING

from aai_cli.core import environments
Expand Down Expand Up @@ -37,14 +38,26 @@ def _flatten_content(messages: object) -> None:
)


def build_model(api_key: str, *, model: str) -> BaseChatModel:
def build_model(
api_key: str,
*,
model: str,
max_tokens: int | None = None,
extra: Mapping[str, object] | None = None,
) -> BaseChatModel:
"""A ChatOpenAI bound to the active environment's LLM Gateway.

``use_responses_api=False`` keeps it on the chat-completions endpoint the gateway
implements (the same one `aai_cli.core.llm` uses), rather than the OpenAI
Responses API that langchain would otherwise prefer for ``openai:`` models. The
subclass also flattens content-parts arrays the gateway rejects (see
:func:`_flatten_content`).

``max_tokens`` caps the per-reply length (the live voice agent passes a small cap to
keep spoken replies short and fast); ``extra`` passes any additional gateway request
fields through as ``extra_body`` (so they reach the request body verbatim, like
`aai_cli.core.llm`'s ``extra``). Both default to off so the coding agent's call is
unchanged.
"""
from langchain_openai import ChatOpenAI
from pydantic import SecretStr
Expand All @@ -64,4 +77,6 @@ def _get_request_payload(
base_url=environments.active().llm_gateway_base,
api_key=SecretStr(api_key),
use_responses_api=False,
max_tokens=max_tokens,
extra_body=dict(extra) if extra else None,
)
4 changes: 3 additions & 1 deletion aai_cli/code_gen/agent_cascade.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,11 @@
# which is never formatted — so no brace has to be doubled.
_HEADER = """\
# Live voice cascade: Streaming STT -> LLM Gateway -> streaming TTS, wired client-side.
# This is what `assembly --sandbox agent-cascade` runs: it transcribes your speech,
# The basic cascade behind `assembly --sandbox live`: it transcribes your speech,
# sends each finalized turn to the LLM Gateway, and speaks the reply through streaming
# TTS — the same three primitives the agent-cascade init template wires server-side.
# (The `live` command adds a tool-using agent on the LLM leg; this snippet is the
# plain single-completion version to build from.)
# Requires audio + websockets: pip install sounddevice websockets openai
# Tip: use headphones — the mic stays open while the agent speaks, so on speakers it
# would hear itself and loop.
Expand Down
2 changes: 1 addition & 1 deletion aai_cli/commands/agent/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ def agent(
help="Print the equivalent Python SDK code and exit (does not start a session)",
),
) -> None:
"""Hold a live two-way voice conversation with a voice agent
"""Hold a live two-way voice conversation with the Voice Agent API

Use headphones: the mic stays open while the agent speaks, so on
speakers it would hear itself and loop. Pass an audio file/URL (or
Expand Down
34 changes: 19 additions & 15 deletions aai_cli/commands/agent_cascade/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
SPEC = command_registry.CommandModuleSpec(
panel=help_panels.TRANSCRIPTION,
order=45, # pragma: no mutate -- sparse rank; a +-1 shift is order-equivalent
commands=("agent-cascade",),
commands=("live",),
)


Expand All @@ -43,28 +43,28 @@ def _emit_voice_list(_state: AppState, json_mode: bool) -> None:


@app.command(
name="agent-cascade",
name="live",
rich_help_panel=help_panels.TRANSCRIPTION,
epilog=examples_epilog(
[
("Start a live cascade conversation", "assembly --sandbox agent-cascade"),
("Start a live voice conversation", "assembly --sandbox live"),
(
"Pick a voice and opening line",
'assembly --sandbox agent-cascade --voice michael --greeting "Hi there"',
'assembly --sandbox live --voice michael --greeting "Hi there"',
),
(
"Give the agent a persona",
'assembly --sandbox agent-cascade --system-prompt "You are a terse pirate."',
'assembly --sandbox live --system-prompt "You are a terse pirate."',
),
("See available voices", "assembly --sandbox agent-cascade --list-voices"),
("See available voices", "assembly --sandbox live --list-voices"),
(
"Print equivalent Python instead of running",
"assembly --sandbox agent-cascade --show-code",
"assembly --sandbox live --show-code",
),
]
),
)
def agent_cascade(
def live(
ctx: typer.Context,
source: str | None = typer.Argument(
None, help="Audio file path or URL to speak to the agent. Omit to use the microphone."
Expand Down Expand Up @@ -169,14 +169,15 @@ def agent_cascade(
help="Print the equivalent Python SDK code and exit (does not start a session)",
),
) -> None:
"""\\[sandbox] Hold a live voice conversation through a self-wired cascade
"""\\[sandbox] Talk live to a tool-using voice agent

Like 'assembly agent', but instead of AssemblyAI's Voice Agent endpoint this
wires the three primitives together itself — Streaming STT, the LLM Gateway,
and streaming TTS — exactly like the 'agent-cascade' init template does
server-side. Because it uses streaming TTS it only runs in the sandbox: run
it as 'assembly --sandbox agent-cascade' (--sandbox goes before the
subcommand).
A real-time spoken conversation, wired client-side from three primitives —
Streaming STT, a deepagents brain on the LLM Gateway, and streaming TTS. Unlike
'assembly agent' (the Voice Agent API), the brain here is an agent that can use
tools mid-conversation — web search, URL fetch, and the AssemblyAI docs — so it
answers like a live multimodal assistant. Because it uses streaming TTS it only
runs in the sandbox: run it as 'assembly --sandbox live' (--sandbox goes before
the subcommand).

Use headphones: the mic stays open while the agent speaks, so on speakers it
would hear itself and loop. Pass an audio file/URL (or --sample) to speak a
Expand All @@ -185,6 +186,9 @@ def agent_cascade(

This only runs a conversation in the terminal — it writes no code. To build
an agent-cascade app, run 'assembly init agent-cascade' instead.

Web search needs a TAVILY_API_KEY in the environment; without it the agent
keeps its URL-fetch and docs tools.
"""

if list_voices:
Expand Down
4 changes: 2 additions & 2 deletions aai_cli/commands/agent_cascade/_exec.py
Original file line number Diff line number Diff line change
Expand Up @@ -169,9 +169,9 @@ def _print_show_code(opts: AgentCascadeOptions, system_prompt_text: str) -> None
def run_agent_cascade(opts: AgentCascadeOptions, state: AppState, *, json_mode: bool) -> None:
"""Execute one `assembly agent-cascade` cascade from already-parsed flags."""
text_mode, json_mode = resolve_output_modes(opts.output_field, json_mode=json_mode)
validate_voice(opts.voice, voices.VOICE_NAMES, command="agent-cascade")
validate_voice(opts.voice, voices.VOICE_NAMES, command="live")
# Streaming TTS has no production host, so the whole cascade is sandbox-only.
tts_session.require_available("agent-cascade")
tts_session.require_available("live")
system_prompt_text = _resolve_system_prompt(opts.system_prompt, opts.system_prompt_file)

if opts.show_code:
Expand Down
Loading
Loading