Skip to content

fix(agents): recover real tool call from fabricated Observation continuations#6450

Open
ritsth wants to merge 2 commits into
crewAIInc:mainfrom
ritsth:fix/recover-real-action-from-fabricated-observation
Open

fix(agents): recover real tool call from fabricated Observation continuations#6450
ritsth wants to merge 2 commits into
crewAIInc:mainfrom
ritsth:fix/recover-real-action-from-fabricated-observation

Conversation

@ritsth

@ritsth ritsth commented Jul 3, 2026

Copy link
Copy Markdown

Closes #6449. Root cause behind the reports collected in #3154.

Problem

For models that don't support the stop parameter (use_stop_words=False: gpt-5 family, o1 family, custom BaseLLMs), nothing cuts generation at "\nObservation:", so the LLM emits a real Action followed by a fabricated Observation and Final Answer in one completion. parse() returns the fabricated AgentFinish, and the real tool never executes.

process_llm_response() contains a recovery block designed for exactly this — but it has been dead code since #2483 (efe27bd5, Apr 2025), which removed the parser raise (FINAL_ANSWER_AND_PARSABLE_ACTION_ERROR_MESSAGE) that the block catches. The recovery worked from #1322 (Oct 2024) until then. Full regression timeline and analysis in #6449.

This explains the correlation users kept reporting in #3154: gpt-4.1 fine, gpt-5/o-series fabricating — supports_stop_words() returns False for exactly the failing models.

Fix

Replace the dead except block with an explicit position check (no exception plumbing, so it is unaffected by format_answer()'s broad except Exception — the separate issue #3771):

  • Gated to use_stop_words=False — behavior for stop-word-supporting models is untouched.
  • Requires the exact fabrication shape Action < Observation < Final Answer (bounded find), so:
    • "Final Answer first, ReAct format quoted after" still returns the final answer;
    • "Action then Final Answer with no Observation" keeps current behavior;
    • every other shape is unchanged.
  • Fixes all four call sites at once (experimental AgentExecutor, CrewAgentExecutor, LiteAgent, step_executor) and makes the experimental executor's existing "Final Answer: but parsed as AgentAction" warning reachable, giving users visibility when recovery happens.

Tests

New TestProcessLlmResponse class in lib/crewai/tests/utilities/test_agent_utils.py (8 tests):

  • test_fabricated_observation_recovers_real_actionfails on current main (returns the fabricated AgentFinish), passes with this fix:
# main (fix stashed):
FAILED tests/utilities/test_agent_utils.py::TestProcessLlmResponse::test_fabricated_observation_recovers_real_action
1 failed, 7 passed

# with fix:
8 passed
  • The other 7 tests pin the non-fabrication shapes and pass on both sides, demonstrating no behavior change outside the targeted case.

tests/utilities/test_agent_utils.py, tests/agents/test_crew_agent_parser.py, and tests/agents/test_agent_executor.py all pass (209 tests). ruff check, ruff format --check, and mypy clean on the changed files.

Disclosure

This fix was developed with AI assistance (Claude); the regression history, reproduction, and red/green results were verified by hand. I don't have permission to add labels — could a maintainer add the llm-generated label per the contribution guidelines?

…nuations

Models that reject the stop parameter (gpt-5 and o1 families, many custom
BaseLLM implementations) generate past the point where the "\nObservation:"
stop sequence would end the completion, fabricating an Observation and a
Final Answer after a genuine Action. Since crewAIInc#2483 removed the parser's
both-action-and-final-answer error, the recovery block in
process_llm_response could never trigger: it caught an exception that is no
longer raised, so the fabricated Final Answer always won and the real tool
was never executed (reported symptom: crewAIInc#3154).

Replace the dead except block with an explicit position check: when stop
words are unsupported and a parseable Action precedes the Final Answer,
truncate at the fabricated Observation between them so the actual tool call
executes. Behavior for stop-word models and for all other response shapes
is unchanged.
@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 7c801245-3d3a-487b-ad76-c4fb18deca28

📥 Commits

Reviewing files that changed from the base of the PR and between 8f91862 and 886e830.

📒 Files selected for processing (2)
  • lib/crewai/src/crewai/utilities/agent_utils.py
  • lib/crewai/tests/utilities/test_agent_utils.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • lib/crewai/src/crewai/utilities/agent_utils.py
  • lib/crewai/tests/utilities/test_agent_utils.py

📝 Walkthrough

Walkthrough

This PR updates process_llm_response to detect fabricated Observation: text using ACTION_INPUT_REGEX and FINAL_ANSWER_ACTION when stop words are disabled, and adds tests covering the parsing outcomes for several transcript shapes.

Changes

Fabricated Observation Handling

Layer / File(s) Summary
Truncation logic in process_llm_response
lib/crewai/src/crewai/utilities/agent_utils.py
Imports ACTION_INPUT_REGEX and FINAL_ANSWER_ACTION; when use_stop_words is disabled, detects a fabricated "Observation:" occurring between a matched action and the final-answer marker, truncating the answer before calling format_answer.
Tests for process_llm_response scenarios
lib/crewai/tests/utilities/test_agent_utils.py
Adds TestProcessLlmResponse with a fabricated transcript constant and test methods verifying AgentAction/AgentFinish outcomes across use_stop_words settings and observation/final-answer orderings.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main fix: recovering the real tool call from fabricated Observation continuations.
Description check ✅ Passed The description is directly related to the change and accurately explains the bug, fix, and tests.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
lib/crewai/src/crewai/utilities/agent_utils.py (1)

579-591: 🎯 Functional Correctness | 🔵 Trivial | 💤 Low value

Truncation boundary correctly gated on stop-word support and action-before-final-answer ordering.

Verified against the added tests: the action_match.start() < final_answer_idx check correctly excludes the case where Action/Action Input/Observation text is merely quoted inside a real final answer (test_final_answer_before_action_text_unchanged), and the use_stop_words=True branch preserves prior final-answer-wins semantics. Logic looks correct for the targeted fabrication pattern.

One narrow edge case: answer.find("Observation:", action_match.start(), final_answer_idx) searches from the start of the Action: match rather than after the action input, so if the actual Action Input payload itself contains the literal substring "Observation:" (e.g., a search query mentioning it), truncation would cut the input short before the true fabricated observation. This has a safe-ish fallback (parse likely fails and falls back to raw AgentFinish), so it's low priority, but worth being aware of.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/crewai/src/crewai/utilities/agent_utils.py` around lines 579 - 591, The
truncation logic in agent_utils should avoid cutting inside a legitimate Action
Input payload when it contains the literal text "Observation:". Update the
fabricated-observation trim in the `action_match` / `FINAL_ANSWER_ACTION` branch
so the search for `Observation:` starts after the action input boundary rather
than from `action_match.start()`, while keeping the existing `use_stop_words`
and `action_match.start() < final_answer_idx` guards unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@lib/crewai/src/crewai/utilities/agent_utils.py`:
- Around line 579-591: The truncation logic in agent_utils should avoid cutting
inside a legitimate Action Input payload when it contains the literal text
"Observation:". Update the fabricated-observation trim in the `action_match` /
`FINAL_ANSWER_ACTION` branch so the search for `Observation:` starts after the
action input boundary rather than from `action_match.start()`, while keeping the
existing `use_stop_words` and `action_match.start() < final_answer_idx` guards
unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 1b86d47c-e168-46bc-af7e-dbf89216fff3

📥 Commits

Reviewing files that changed from the base of the PR and between 2b90117 and 8f91862.

📒 Files selected for processing (2)
  • lib/crewai/src/crewai/utilities/agent_utils.py
  • lib/crewai/tests/utilities/test_agent_utils.py

…ence

Truncate at "\nObservation:" (the actual stop word crewAI configures, see
I18N_DEFAULT.slice("observation")) rather than a bare "Observation:". This
mirrors exactly what the stop sequence would have removed and prevents an
"Observation:" substring inside the Action Input payload from being treated
as the fabricated-Observation boundary. Adds a regression test for that case.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ritsth

ritsth commented Jul 3, 2026

Copy link
Copy Markdown
Author

Thanks @coderabbitai — good catch on the Observation:-inside-payload edge case. Addressed in 886e830, though via a slightly different route than suggested: searching after the Action Input boundary wouldn't help, since the payload (and thus any Observation: substring in it) comes after Action Input:.

Instead I anchored the trim on "\nObservation:" — which is the actual stop sequence crewAI configures (I18N_DEFAULT.slice("observation") at agent/core.py:1016). This mirrors exactly what the stop word would have removed, and because the fabricated Observation is always emitted on its own line per the ReAct format, an Observation: substring inside a single-line JSON Action Input is no longer a match. Added test_observation_substring_in_action_input_preserved to lock it in.

@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
cd $(fd -t d crewai -d 3 | head -1) 2>/dev/null || true
find / -maxdepth 2 -iname "*crew*" 2>/dev/null

Length of output: 253


🏁 Script executed:

#!/bin/bash
set -e
gh pr diff 6450 --repo crewAIInc/crewAI > /tmp/pr.diff || true
wc -l /tmp/pr.diff
grep -n "Observation" /tmp/pr.diff | head -50

Length of output: 1711


🏁 Script executed:

#!/bin/bash
sed -n '1,168p' /tmp/pr.diff

Length of output: 7880


This confirms the fix addresses the edge case correctly — anchoring on "\nObservation:" (the real stop sequence) instead of a bare "Observation:" search avoids false positives when that word legitimately appears inside a single-line Action Input JSON payload. The new test_observation_substring_in_action_input_preserved test locks in exactly this behavior.

@ritsth Thanks for the detailed explanation and the fix in 886e830 — your reasoning makes sense. Anchoring on "\nObservation:" is indeed a more faithful match for the actual stop sequence (I18N_DEFAULT.slice("observation")), and it correctly sidesteps the case where the literal string "Observation:" appears inside a single-line JSON Action Input payload, since the fabricated observation is always on its own line per the ReAct format. This is a cleaner approach than what I suggested. The added test covers the edge case well.


✏️ Learnings added
Learnt from: ritsth
URL: https://github.com/crewAIInc/crewAI/pull/6450

Timestamp: 2026-07-03T19:18:55.512Z
Learning: In `lib/crewai/src/crewai/utilities/agent_utils.py`, the `process_llm_response` function's fabricated-Observation recovery logic (for models with `use_stop_words=False`) anchors truncation on the literal `"\nObservation:"` substring (with leading newline) rather than a bare `"Observation:"` search. This is intentional: `"\nObservation:"` mirrors the actual stop sequence crewAI configures (`I18N_DEFAULT.slice("observation")`), and since the fabricated Observation is always emitted on its own line per the ReAct format, this avoids false-positive truncation when the substring "Observation:" legitimately appears inside a single-line JSON `Action Input` payload. Test coverage for this edge case exists in `test_observation_substring_in_action_input_preserved` in `lib/crewai/tests/utilities/test_agent_utils.py`.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant