The open reference pipeline for AI agents that think before they act.
Intent → Ambiguity → Clarifier → Planner → Executor. Five typed stages. One streaming pipeline. ~4k lines you can read in an afternoon.
Not playing? Watch on YouTube ↗
Each stage has a typed input and a typed output. The Pydantic schema between any two stages is your test surface — and your debug trail.
|
|
Clarifier searches the web first, auto-fills what it can, and asks you only what it genuinely couldn't find. One question. Not seven. |
No Exa? Skips web. No Redis? In-memory. No RAG? No problem. Missing keys are features, not errors. |
git clone https://github.com/OpenGraph-AI/OpenAgent.git
cd OpenAgent
pip install -r requirements.txt
cp .env.example .env # set LLM_API_KEY at minimum
python run.pyOpen http://<your-domain>:8000/static/index.html and type a fuzzy request. Watch each phase stream into the UI in real time: intent extraction, ambiguity flags, clarifying questions, the plan, and finally the executor producing the answer step-by-step.
Minimum config — one variable:
| Variable | Required | Purpose |
|---|---|---|
LLM_API_KEY |
✅ | Any OpenAI-compatible provider |
LLM_BASE_URL |
— | Defaults to https://api.openai.com/v1 |
LLM_MODEL |
— | Defaults to gpt-4o |
Optional providers (Exa for web search, Upstash for session persistence, PageIndex for RAG) are listed in .env.example. The pipeline degrades gracefully when any of them are missing — that's on purpose.
Prefer Docker?
docker compose upEach stage is a specialist. The output of one is the typed input of the next. If any stage misbehaves, you can swap it, mock it, or inspect it without touching the others.
Read this like a cookbook. Every stage answers a question you'll eventually have to answer yourself:
How do I turn a blurry request into something structured? How do I know when I don't know enough? How do I ask without annoying the user? How do I turn intent into an executable plan? How do I execute without losing the thread?
Five questions. Five agents. That's the whole book.
| # | Stage | In | Out | Mission |
|---|---|---|---|---|
| 01 | 🧠 Intent | str |
IntentSchema |
Turn fuzz into a typed goal |
| 02 | ❓ Ambiguity | IntentSchema |
AmbiguityReport |
Flag the known unknowns |
| 03 | 🕸️ Clarifier | AmbiguityReport |
ClarifiedIntent |
Auto-resolve, ask only the rest |
| 04 | 🗺️ Planner | ClarifiedIntent |
ExecutionPlan |
A DAG of verifiable steps |
| 05 | ⚡ Executor | ExecutionPlan |
ExecutionResult |
Run, stream, trace to goal |
The problem. Humans don't type goals. They type fragments, moods, half-sentences. "can you make this better" is not a specification — it's a vibe. Executing on a vibe gives you a confident-sounding wrong answer.
The job. Turn raw text into a typed object the rest of the pipeline can reason about: goal, context, constraints, output format, success criteria, and alternative interpretations. That last field is the one most tutorials skip and the one that saves you.
The mental model. Think of Intent as the function signature. Until you have it, you don't have a problem — you have a feeling.
The pattern. One LLM call, a strict schema, and a prompt that forbids prose. Parse, validate, store.
class IntentSchema(BaseModel):
goal: str
context: str
constraints: list[str]
expected_output_format: str
success_criteria: list[str]
alternative_interpretations: list[str] # ← your future self will thank you
confidence: float # ← triggers the next stagePitfalls.
- Skipping
alternative_interpretations— you lose the ability to catch ambiguity downstream. - Letting the model hallucinate fields by not pinning the schema.
- Treating low-confidence intents as high-confidence ones. Confidence is a signal — use it.
Code: backend/agents/intent_agent.py
The problem. Even a cleanly extracted intent can be under-specified. "Write a blog post about our launch" is structured but still missing: audience, length, tone, deadline, distribution channel. An agent that steamrolls past this will produce a polished artifact no one asked for.
The job. Audit the intent along fixed dimensions — scope, audience, depth, format, deadline, domain — and flag each with a severity: low, medium, high. If any flag is medium-or-higher, set needs_clarification = true.
The mental model. This is your agent's epistemic humility layer. Its output is a list of known unknowns. Don't let the agent "figure it out" — let it admit what it can't.
The pattern.
class AmbiguityReport(BaseModel):
flags: list[AmbiguityFlag] # dimension, level, impact
needs_clarification: bool
reasoning: strThe report is a decision gate. The pipeline branches on needs_clarification, not on a gut feel.
Pitfalls.
- Letting the ambiguity agent also write the clarification questions. Separate the "what's missing" from the "how do I ask" — otherwise you optimise both poorly.
- Hard-coding severity thresholds in the prompt instead of leaving them explicit as fields you can tune later.
Code: backend/agents/ambiguity_agent.py
The problem. The naive fix for ambiguity is "just ask the user." Do that for every request and your agent becomes a questionnaire. The good fix: answer what the web can answer, ask the user for what only they know.
The job. For each ambiguity flag:
- Derive 2–3 targeted web queries from the flag's dimension.
- Search (Exa, in this repo — pluggable).
- Ask the LLM: "can this question be confidently answered from these results alone?" with a confidence threshold (default 0.7).
- If yes → auto-resolve, mark the source.
- If no → surface to the user as a minimal, decision-oriented question.
The mental model. This is a cost-triage step. User attention is the most expensive resource your agent has. Spend it only on personal or organizational context — things that are genuinely unknowable without the user.
The pattern.
questions, search_results = await clarifier.generate_questions(ambiguity_report)
questions = await clarifier.auto_resolve_questions(questions, search_results, ambiguity_report)
unresolved = [q for q in questions.questions if not q.auto_resolved]
if unresolved:
await send_to_user(unresolved) # pause the pipeline
answers = await wait_for_user_response()
else:
answers = clarifier.build_auto_answers(questions) # sail through
clarified = await clarifier.process_answers(ambiguity_report, answers)Pitfalls.
- Asking the user every question you could have searched for.
- Setting the auto-resolve confidence threshold too low (the model will confabulate sources).
- Asking four questions at once. Cap it at three. Users answer three; they abandon seven.
Code: backend/agents/clarification_agent.py
The problem. Dropping a clarified goal into a single "do the thing" prompt gives you a brittle monolith. The model can't back up. You can't resume. You can't verify anything until the whole thing finishes — and by then, you're five paragraphs deep into the wrong answer.
The job. Turn a clarified intent into a list of numbered, dependency-aware, independently verifiable steps. Each step declares its inputs, expected output, dependencies (by step number), and a validation criterion.
The mental model. A plan is a contract. Each step is a row; the whole plan is auditable before anything runs.
The pattern.
class PlanStep(BaseModel):
step_number: int
description: str
inputs: list[str]
expected_output: str
dependencies: list[int] # topological execution
validation: str # how to know it succeededDependencies let the executor topologically sort steps and, later, parallelize independent branches. Validation turns "done" into a checkable predicate instead of a feeling.
Pitfalls.
- Vague steps ("Analyze the data"). If you can't say how you'd verify it, the model can't execute it.
- Flat plans with no dependency graph. You'll miss parallelism opportunities and silently re-execute work.
- Coupling planning and execution in one prompt. You lose the ability to inspect, edit, or cache the plan.
Code: backend/agents/planning_agent.py
The problem. An executor that reaches for tools mid-generation is slow and erratic. The model decides while generating what to search for, then context-switches. Quality drops.
The job. Before executing any step, read the whole plan and, for each step, decide what it needs: knowledge-base lookups, web searches, outputs from dependency steps. Fan out all retrievals in parallel. Attach the results to each step as a StepContext.
The mental model. Separate gathering from reasoning. Gathering is embarrassingly parallel. Reasoning is serial. Don't interleave them.
The pattern.
resource_plan = await context_agent.gather(plan, clarified_intent, session_state)
# resource_plan.step_contexts[i] contains everything step i will needCode: backend/agents/context_agent.py
The problem. "Execute the plan" is another vibe. A real executor has to: run steps in dependency order, pass prior outputs into dependents, keep streaming chunks to the UI, handle a step that fails without corrupting the rest, and at the end prove the deliverable actually answers the original goal.
The job. Four things:
- Topologically sort steps.
- For each step, build a message with: the step description, its gathered context, and the outputs of its dependencies.
- Stream chunks back as the model generates, so the UI isn't frozen.
- At the end, run a final-assembly pass with explicit checks: completeness, clarity, relevance, correctness, and a
trace_to_goalthat maps the deliverable back to the original intent.
The mental model. Execution is a loop with memory. Each iteration has a narrow, well-scoped prompt. The outer loop holds the state.
The pattern.
for step in topological_sort(plan.steps):
ctx = resource_plan.step_contexts[step.step_number]
deps = {d: step_results[d] for d in step.dependencies}
result = await executor.process_step(step, ctx, deps, on_stream=send_chunk)
step_results[step.step_number] = result
final = await executor.assemble_final(plan, list(step_results.values()))
# final includes: output, completeness_check, clarity_check,
# relevance_check, correctness_check, trace_to_goalPitfalls.
- One giant "execute everything" prompt. You lose per-step streaming, per-step failure recovery, and the ability to retry just the broken step.
- Skipping
trace_to_goal. The final check is what catches a technically-correct, goal-irrelevant output. - Throwing away intermediate step results once you have the final. Keep them — that's your debug trail.
Code: backend/agents/execution_agent.py
Small patterns that keep recurring once you start building real agents.
Stream by default, not as an afterthought. Every agent here accepts an on_stream callback. If your LLM client doesn't support streaming, wrap it so it does — even if the wrapper just flushes at the end. Your UI code should never have two paths.
Pause the pipeline like a coroutine, not a state machine. Clarification pauses between phases and resumes when the user answers. Model this with a session object that knows its current phase, not a pile of booleans. See SessionState and resume_after_clarification in backend/orchestrator/pipeline.py.
Cache the intent, not the final output. Intent is a deterministic function of the user's text + your prompt. Cache it. The final output depends on the full session including clarifications — caching it will bite you.
Use typed schemas at every boundary. Pydantic models between agents aren't ceremony — they're your test surface. A bad output gets caught at parse time, not six steps later when something dereferences a missing field.
Run all retrievals for a step in parallel. The context gatherer fans out KB, web, and prior-phase extraction with asyncio.gather. Sequential retrievals leave 3–5× latency on the table.
Keep the LLM out of control flow. The LLM decides content. Python decides flow — which phase runs, when to pause, when to retry, when to fall back. If your agent has an if tool_name == "ask_user": … branch inside the prompt, you've inverted it.
| OpenAgent | LangGraph | CrewAI | AutoGen | |
|---|---|---|---|---|
| Mental model | Typed pipeline | Graph of nodes | Role-playing crew | Multi-agent chat |
| Typed contracts between stages | ✅ Pydantic | |||
| Auto-resolving clarifier | ✅ Built-in | ❌ | ❌ | ❌ |
| Model lock-in | None | None | None | None |
| Framework weight | ~4k LOC, readable | Heavy | Heavy | Heavy |
| "Pause for user" as first-class | ✅ | ❌ | ||
| Reads like a cookbook | ✅ By design | ❌ | ❌ |
When to pick OpenAgent. You want to understand every moving part, control each prompt, and own your agent's reasoning end-to-end — not inherit someone else's abstraction.
When to pick a framework instead. You want to ship fast without thinking about architecture, and the framework's defaults happen to match your domain.
| 🛠️ First-agent builders | Five while True: llm() prototypes in a drawer. None ship. Start here. |
| 🏗️ Framework evaluators | You've read the LangGraph docs twice and still don't trust the abstractions. This is ~4k lines. Read it in an afternoon. |
| 🧪 Production debuggers | "It's doing something weird in prod." OpenAgent tells you exactly which stage lied — with the transcript. |
If you're here to learn, open files in this order:
backend/models/schemas.py— the contracts between phases. Read this first; everything else is transformations.backend/agents/intent_agent.py— the simplest agent. A good template for your own.backend/agents/clarification_agent.py— the most interesting one. Auto-resolve via web search is the trick worth stealing.backend/orchestrator/pipeline.py— how the phases are wired, paused, and resumed.backend/agents/execution_agent.py— step-by-step execution with per-step context injection.
- Typed contracts between all five stages
- Anthropic-native tool use (beyond OpenAI-compatible)
- Step-level retries with plan-edit capability
- First-class observability (OpenTelemetry spans per stage)
- Browser extension: capture user intent from any form
Roadmap subject to change. Open an issue if one of these matters to you and we'll bump it.
Issues, patches, and hard questions are all welcome. See CONTRIBUTING.md for the short version — fork, keep the change small, smoke-test with python run.py, and open the PR with a one-paragraph why. Every stage is intentionally small enough that a PR can meaningfully change one thing at a time.
OpenGraph.tech builds the infrastructure for agents that reason openly, not opaquely. This repo is our reference pipeline — the thing we run, the thing we ship against, and the thing we learn from. If you're building agents in production and want to compare notes, we'd like to hear from you.
Follow us on LinkedIn — that's where we post build notes and what shipped this week.
This repo is free. The cookbook is free. The walkthrough is on YouTube, free. The only thing we ask back is a star — it's the one signal that tells us to write more of these, louder.
One click. No account prompt. It genuinely helps.
Made with intent, by OpenGraph.tech.