Skip to content

fix(crewai-tools): use defusedxml in RAG XMLLoader to prevent XXE#6446

Open
camgrimsec wants to merge 4 commits into
crewAIInc:mainfrom
camgrimsec:fix-xml-loader-xxe
Open

fix(crewai-tools): use defusedxml in RAG XMLLoader to prevent XXE#6446
camgrimsec wants to merge 4 commits into
crewAIInc:mainfrom
camgrimsec:fix-xml-loader-xxe

Conversation

@camgrimsec

Copy link
Copy Markdown

XMLLoader parses XML content sourced from user-supplied URLs and file paths via xml.etree.ElementTree, which allows external entity resolution and unbounded entity expansion. Documents containing DOCTYPE declarations can trigger XXE (file read / SSRF) or billion-laughs denial of service.

Swap fromstring/parse over to defusedxml.ElementTree, which is API compatible and rejects entity declarations by default. Add defusedxml as a runtime dependency and drop the two noqa: S314 markers now that the underlying warning no longer applies.

Adds tests/rag/test_xml_loader_security.py covering:

  • external-entity documents raise EntitiesForbidden
  • nested-entity (billion-laughs) documents raise EntitiesForbidden
  • benign XML still parses to the expected tree
  • malicious XML fetched from a URL is rejected

XMLLoader parses XML content sourced from user-supplied URLs and file
paths via xml.etree.ElementTree, which allows external entity resolution
and unbounded entity expansion. Documents containing DOCTYPE declarations
can trigger XXE (file read / SSRF) or billion-laughs denial of service.

Swap fromstring/parse over to defusedxml.ElementTree, which is API
compatible and rejects entity declarations by default. Add defusedxml
as a runtime dependency and drop the two noqa: S314 markers now that
the underlying warning no longer applies.

Adds tests/rag/test_xml_loader_security.py covering:
 - external-entity documents raise EntitiesForbidden
 - nested-entity (billion-laughs) documents raise EntitiesForbidden
 - benign XML still parses to the expected tree
 - malicious XML fetched from a URL is rejected
@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: b4c2c7e3-88cb-4f01-95b7-d5b7e8e9681d

📥 Commits

Reviewing files that changed from the base of the PR and between d959e73 and 809b14d.

📒 Files selected for processing (2)
  • lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py
  • lib/crewai-tools/tests/rag/test_xml_loader_security.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • lib/crewai-tools/tests/rag/test_xml_loader_security.py

📝 Walkthrough

Walkthrough

This PR adds defusedxml to lib/crewai-tools, switches XMLLoader to defusedxml-based parsing and security handling, and adds regression tests for malicious, malformed, and safe XML inputs.

Changes

XXE Hardening

Layer / File(s) Summary
Dependency and parser changes
lib/crewai-tools/pyproject.toml, lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py
Adds defusedxml>=0.7.1 and updates XMLLoader to parse XML with defusedxml.ElementTree.fromstring, handling DefusedXmlException with empty content and security_error metadata.
Security regression tests
lib/crewai-tools/tests/rag/test_xml_loader_security.py
Adds coverage for XXE, entity expansion, safe XML, malformed XML, BOM-prefixed URL payloads, and malicious XML returned from a URL source.

Sequence Diagram(s)

sequenceDiagram
  participant Test as TestXMLLoaderSecurity
  participant Loader as XMLLoader
  participant Defused as defusedxml.ElementTree

  Test->>Loader: load(malicious XML)
  Loader->>Defused: fromstring(content)
  Defused-->>Loader: DefusedXmlException
  Loader-->>Test: empty content + security_error metadata

  Test->>Loader: load(safe XML)
  Loader->>Defused: fromstring(content)
  Defused-->>Loader: parsed root element
  Loader-->>Test: content + format/root_tag metadata
Loading
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: switching the XML loader to defusedxml to prevent XXE.
Description check ✅ Passed The description accurately matches the security-focused XMLLoader changes and added tests.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py (1)

42-58: 🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Catch defusedxml exceptions at the loader boundary. XMLLoader._parse_xml() only handles ParseError, so EntitiesForbidden/DTDForbidden/ExternalReferenceForbidden can escape through RAG.add() and direct XMLLoader.load() calls. Add a DefusedXmlException catch (or equivalent handling) to keep a bad XML document from aborting the whole add path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py` around lines 42
- 58, XMLLoader._parse_xml currently only catches ParseError, so defusedxml
safety exceptions can still escape and break RAG.add and XMLLoader.load. Update
_parse_xml in XMLLoader to also catch DefusedXmlException (or the specific
defusedxml exception base used by the loader) alongside ParseError, and fall
back to returning the raw content with parse_error metadata so malformed or
unsafe XML is contained at the loader boundary.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py`:
- Around line 42-58: XMLLoader._parse_xml currently only catches ParseError, so
defusedxml safety exceptions can still escape and break RAG.add and
XMLLoader.load. Update _parse_xml in XMLLoader to also catch DefusedXmlException
(or the specific defusedxml exception base used by the loader) alongside
ParseError, and fall back to returning the raw content with parse_error metadata
so malformed or unsafe XML is contained at the loader boundary.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: d128567f-9c24-49cd-a8e9-95cdbc5c3355

📥 Commits

Reviewing files that changed from the base of the PR and between 2b90117 and 6c4afa5.

📒 Files selected for processing (3)
  • lib/crewai-tools/pyproject.toml
  • lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py
  • lib/crewai-tools/tests/rag/test_xml_loader_security.py

XMLLoader._parse_xml only caught xml.etree.ParseError, so any
DefusedXmlException (EntitiesForbidden / DTDForbidden /
ExternalReferenceForbidden) raised while parsing a hostile document
would escape the loader and abort the caller. That gives a single
malicious XML source the power to break a whole RAG.add() run.

Catch DefusedXmlException alongside ParseError and surface it through
metadata as 'security_error' (distinct from 'parse_error' so callers
can tell malformed XML from actively-refused XML). Content is emptied
in the security case since we deliberately did not evaluate any of it.

Tests updated to assert the new contract and a new
test_malformed_xml_still_returns_parse_error case guards the existing
ParseError path so the two error kinds do not get conflated.
@camgrimsec

Copy link
Copy Markdown
Author

Thanks @coderabbitai — good catch, this was a real gap.

Pushed d959e73 which catches defusedxml.common.DefusedXmlException alongside ParseError in _parse_xml. Rather than folding it into the existing parse_error branch, I surfaced it as a separate security_error key on metadata so downstream code can tell "the document was malformed" apart from "the document was actively refused by the security policy." Content is emptied in the security case since none of it was evaluated.

Also added test_malformed_xml_still_returns_parse_error to lock in that the ParseError path is not silently reclassified.

Diff for the follow-up:

        except DefusedXmlException as e:
            text = ""
            metadata = {"format": "xml", "security_error": str(e)}
        except ParseError as e:
            text = content
            metadata = {"format": "xml", "parse_error": str(e)}

Let me know if you'd rather see this collapsed under a single error key — happy to change the shape if it clashes with anything downstream.

@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
fd xml_loader.py
echo "---"
fd test_xml_loader_security.py

Length of output: 374


🏁 Script executed:

#!/bin/bash
cat -n lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py
echo "---TEST---"
cat -n lib/crewai-tools/tests/rag/test_xml_loader_security.py

Length of output: 7657


@camgrimsec Thanks for the quick follow-up — I checked d959e73 and it looks correct:

  • DefusedXmlException is imported from defusedxml.common, and since EntitiesForbidden (raised for both XXE and billion-laughs) is a subclass of it, the except DefusedXmlException block will catch both attack vectors.
  • The except-block ordering is safe: DefusedXmlException and ParseError are unrelated hierarchies, so there's no risk of one shadowing the other.
  • Emptying content in the security case makes sense — since the entity was never resolved, there's nothing meaningful to return, and returning it would risk giving a false sense of "safe" content back to callers.
  • test_malformed_xml_still_returns_parse_error is a good addition; it locks in that ordinary malformed XML doesn't get misclassified as a security issue.

On your question: I'd keep security_error and parse_error as separate keys rather than collapsing them into one error field. They represent semantically different failure modes ("actively refused for safety" vs. "just malformed"), and any existing code checking for parse_error specifically would silently break if it were renamed/merged. Your current shape is the right call — no changes needed from my side.

Nice work closing this gap.

[approve]

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py (1)

43-48: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Parse content unconditionally. The parse(source_ref) fallback can turn a URL input into an uncaught FileNotFoundError/OSError when the fetched payload doesn’t satisfy the < check (for example, a leading BOM). Parsing the already-loaded string keeps this inside the loader’s error handling.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py` around lines 43
- 48, In _parse_xml, avoid falling back to parse(source_ref) based on the
leading-< check; always parse the already loaded content string so XML inputs
stay within the loader’s existing exception handling. Update the logic around
fromstring and remove the source_ref-based parse path to prevent uncaught
FileNotFoundError/OSError for valid payloads that don’t start with “<”.
🧹 Nitpick comments (1)
lib/crewai-tools/tests/rag/test_xml_loader_security.py (1)

18-24: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Docstring overstates stdlib XXE risk.

Per Python's own XML vulnerabilities documentation and defusedxml's docs, stdlib xml.etree.ElementTree does not resolve external/SYSTEM entities — it raises a ParseError instead of leaking file contents. The real value of this migration here is blocking internal entity expansion (billion laughs, covered by the next test) and providing defense-in-depth/consistent containment behavior, not fixing a stdlib file-exfiltration bug that doesn't exist in ElementTree.

✏️ Suggested docstring correction
-        The stdlib xml.etree parser would resolve `SYSTEM` entities (leaking
-        file contents). defusedxml refuses; XMLLoader captures that and
-        surfaces `security_error` in metadata.
+        defusedxml refuses to process any declared entity by default
+        (including this external `SYSTEM` reference); XMLLoader captures
+        that and surfaces `security_error` in metadata instead of raising.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/crewai-tools/tests/rag/test_xml_loader_security.py` around lines 18 - 24,
Update the docstring in test_rejects_external_entity_expansion to remove the
claim that stdlib xml.etree resolves SYSTEM external entities and leaks file
contents. Keep the test intent aligned with XMLLoader’s security behavior by
describing that defusedxml rejects the external entity and that the loader
surfaces security_error in metadata, referencing the test method name and the
XMLLoader behavior it verifies.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py`:
- Around line 43-48: In _parse_xml, avoid falling back to parse(source_ref)
based on the leading-< check; always parse the already loaded content string so
XML inputs stay within the loader’s existing exception handling. Update the
logic around fromstring and remove the source_ref-based parse path to prevent
uncaught FileNotFoundError/OSError for valid payloads that don’t start with “<”.

---

Nitpick comments:
In `@lib/crewai-tools/tests/rag/test_xml_loader_security.py`:
- Around line 18-24: Update the docstring in
test_rejects_external_entity_expansion to remove the claim that stdlib xml.etree
resolves SYSTEM external entities and leaks file contents. Keep the test intent
aligned with XMLLoader’s security behavior by describing that defusedxml rejects
the external entity and that the loader surfaces security_error in metadata,
referencing the test method name and the XMLLoader behavior it verifies.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 9f5e6627-5332-4adf-a874-963ea145ddf6

📥 Commits

Reviewing files that changed from the base of the PR and between 6c4afa5 and d959e73.

📒 Files selected for processing (2)
  • lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py
  • lib/crewai-tools/tests/rag/test_xml_loader_security.py

The old dual-path logic in _parse_xml fell back to parse(source_ref)
when content did not start with '<'. For any URL source that returned
a payload with a BOM prefix, leading whitespace, or an XML declaration
that failed the startswith check, source_ref got handed to parse() as
a filesystem path and raised an uncaught FileNotFoundError / OSError.
load_from_url already produced a string in memory, so there was no
reason to reach back through parse() at all.

Drop the fallback and always parse the already-loaded content string
so error handling stays inside the existing DefusedXmlException /
ParseError branches. Remove the now-unused parse import.

Adds test_url_payload_with_bom_is_parsed_not_treated_as_path to lock
in the BOM-prefixed case. Also tightens the docstring in
test_rejects_external_entity_expansion to describe what defusedxml
actually does (refuses declared entities) rather than overclaiming
that stdlib ElementTree leaks SYSTEM entity contents (it does not;
it raises ParseError).
@camgrimsec

Copy link
Copy Markdown
Author

Thanks @coderabbitai — both good calls.

Pushed 809b14d addressing the two items from this pass:

Major — always parse content as a string (43-48)

You're right that the parse(source_ref) fallback was a latent bug: for any URL payload where content.strip().startswith("<") returned False (BOM prefix, leading whitespace, XML declaration edge case), source_ref — a URL string — got handed to parse() as a filesystem path and would raise an uncaught FileNotFoundError. load_from_url already produces the payload in memory, so the fallback was never doing useful work anyway. Dropped it and the now-unused parse import.

Added test_url_payload_with_bom_is_parsed_not_treated_as_path covering the BOM case explicitly so the regression is guarded.

def _parse_xml(self, content: str, source_ref: str) -> LoaderResult:
    try:
        root = fromstring(content)
        ...

Trivial — docstring on test_rejects_external_entity_expansion

Also correct — stdlib xml.etree.ElementTree raises ParseError on external SYSTEM entities, not a silent file-content leak. Rewrote the docstring to describe what defusedxml actually does (refuses declared entities by default) rather than the overclaimed stdlib behavior.

Local check of all six tests still passes: XXE refusal, billion-laughs refusal, safe parse, malformed-XML parse_error, BOM-prefixed URL payload, and URL-sourced XXE.

@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
fd xml_loader.py
echo "---"
fd test_xml_loader_security.py

Length of output: 374


🏁 Script executed:

#!/bin/bash
cat -n lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py
echo "---TEST---"
cat -n lib/crewai-tools/tests/rag/test_xml_loader_security.py

Length of output: 8544


@camgrimsec Verified 809b14d — looks solid.

  • Confirmed _parse_xml now calls fromstring(content) unconditionally with no parse(source_ref) fallback, and the parse import is gone. That fully closes the BOM/leading-whitespace/URL-as-filesystem-path bug — there's no code path left that could hand a URL string to parse().
  • test_url_payload_with_bom_is_parsed_not_treated_as_path correctly exercises the scenario: mocks load_from_url, sends a BOM-prefixed payload, and asserts the content parses with no parse_error/security_error. Good regression coverage.
  • The docstring rewrite on test_rejects_external_entity_expansion is accurate now — describes defusedxml's default entity-refusal rather than overclaiming stdlib behavior.

Nice cleanup — this was a good catch on your end regarding the always-in-memory parsing. No further changes needed from my side.

[approve]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant