fix(crewai-tools): use defusedxml in RAG XMLLoader to prevent XXE by camgrimsec · Pull Request #6446 · crewAIInc/crewAI

camgrimsec · 2026-07-03T04:30:29Z

XMLLoader parses XML content sourced from user-supplied URLs and file paths via xml.etree.ElementTree, which allows external entity resolution and unbounded entity expansion. Documents containing DOCTYPE declarations can trigger XXE (file read / SSRF) or billion-laughs denial of service.

Swap fromstring/parse over to defusedxml.ElementTree, which is API compatible and rejects entity declarations by default. Add defusedxml as a runtime dependency and drop the two noqa: S314 markers now that the underlying warning no longer applies.

Adds tests/rag/test_xml_loader_security.py covering:

external-entity documents raise EntitiesForbidden
nested-entity (billion-laughs) documents raise EntitiesForbidden
benign XML still parses to the expected tree
malicious XML fetched from a URL is rejected

XMLLoader parses XML content sourced from user-supplied URLs and file paths via xml.etree.ElementTree, which allows external entity resolution and unbounded entity expansion. Documents containing DOCTYPE declarations can trigger XXE (file read / SSRF) or billion-laughs denial of service. Swap fromstring/parse over to defusedxml.ElementTree, which is API compatible and rejects entity declarations by default. Add defusedxml as a runtime dependency and drop the two noqa: S314 markers now that the underlying warning no longer applies. Adds tests/rag/test_xml_loader_security.py covering: - external-entity documents raise EntitiesForbidden - nested-entity (billion-laughs) documents raise EntitiesForbidden - benign XML still parses to the expected tree - malicious XML fetched from a URL is rejected

coderabbitai · 2026-07-03T04:31:09Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: b4c2c7e3-88cb-4f01-95b7-d5b7e8e9681d

📥 Commits

Reviewing files that changed from the base of the PR and between d959e73 and 809b14d.

📒 Files selected for processing (2)

lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py
lib/crewai-tools/tests/rag/test_xml_loader_security.py

🚧 Files skipped from review as they are similar to previous changes (1)

lib/crewai-tools/tests/rag/test_xml_loader_security.py

📝 Walkthrough

Walkthrough

This PR adds defusedxml to lib/crewai-tools, switches XMLLoader to defusedxml-based parsing and security handling, and adds regression tests for malicious, malformed, and safe XML inputs.

Changes

XXE Hardening

Layer / File(s)	Summary
Dependency and parser changes `lib/crewai-tools/pyproject.toml`, `lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py`	Adds `defusedxml>=0.7.1` and updates `XMLLoader` to parse XML with `defusedxml.ElementTree.fromstring`, handling `DefusedXmlException` with empty content and `security_error` metadata.
Security regression tests `lib/crewai-tools/tests/rag/test_xml_loader_security.py`	Adds coverage for XXE, entity expansion, safe XML, malformed XML, BOM-prefixed URL payloads, and malicious XML returned from a URL source.

Sequence Diagram(s)

sequenceDiagram
  participant Test as TestXMLLoaderSecurity
  participant Loader as XMLLoader
  participant Defused as defusedxml.ElementTree

  Test->>Loader: load(malicious XML)
  Loader->>Defused: fromstring(content)
  Defused-->>Loader: DefusedXmlException
  Loader-->>Test: empty content + security_error metadata

  Test->>Loader: load(safe XML)
  Loader->>Defused: fromstring(content)
  Defused-->>Loader: parsed root element
  Loader-->>Test: content + format/root_tag metadata

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: switching the XML loader to defusedxml to prevent XXE.
Description check	✅ Passed	The description accurately matches the security-focused XMLLoader changes and added tests.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py (1)
42-58: 🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Catch defusedxml exceptions at the loader boundary. XMLLoader._parse_xml() only handles ParseError, so EntitiesForbidden/DTDForbidden/ExternalReferenceForbidden can escape through RAG.add() and direct XMLLoader.load() calls. Add a DefusedXmlException catch (or equivalent handling) to keep a bad XML document from aborting the whole add path.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py` around lines 42
- 58, XMLLoader._parse_xml currently only catches ParseError, so defusedxml
safety exceptions can still escape and break RAG.add and XMLLoader.load. Update
_parse_xml in XMLLoader to also catch DefusedXmlException (or the specific
defusedxml exception base used by the loader) alongside ParseError, and fall
back to returning the raw content with parse_error metadata so malformed or
unsafe XML is contained at the loader boundary.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py`:
- Around line 42-58: XMLLoader._parse_xml currently only catches ParseError, so
defusedxml safety exceptions can still escape and break RAG.add and
XMLLoader.load. Update _parse_xml in XMLLoader to also catch DefusedXmlException
(or the specific defusedxml exception base used by the loader) alongside
ParseError, and fall back to returning the raw content with parse_error metadata
so malformed or unsafe XML is contained at the loader boundary.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: d128567f-9c24-49cd-a8e9-95cdbc5c3355

📥 Commits

Reviewing files that changed from the base of the PR and between 2b90117 and 6c4afa5.

📒 Files selected for processing (3)

lib/crewai-tools/pyproject.toml
lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py
lib/crewai-tools/tests/rag/test_xml_loader_security.py

XMLLoader._parse_xml only caught xml.etree.ParseError, so any DefusedXmlException (EntitiesForbidden / DTDForbidden / ExternalReferenceForbidden) raised while parsing a hostile document would escape the loader and abort the caller. That gives a single malicious XML source the power to break a whole RAG.add() run. Catch DefusedXmlException alongside ParseError and surface it through metadata as 'security_error' (distinct from 'parse_error' so callers can tell malformed XML from actively-refused XML). Content is emptied in the security case since we deliberately did not evaluate any of it. Tests updated to assert the new contract and a new test_malformed_xml_still_returns_parse_error case guards the existing ParseError path so the two error kinds do not get conflated.

camgrimsec · 2026-07-03T04:55:22Z

Thanks @coderabbitai — good catch, this was a real gap.

Pushed d959e73 which catches defusedxml.common.DefusedXmlException alongside ParseError in _parse_xml. Rather than folding it into the existing parse_error branch, I surfaced it as a separate security_error key on metadata so downstream code can tell "the document was malformed" apart from "the document was actively refused by the security policy." Content is emptied in the security case since none of it was evaluated.

Also added test_malformed_xml_still_returns_parse_error to lock in that the ParseError path is not silently reclassified.

Diff for the follow-up:

        except DefusedXmlException as e:
            text = ""
            metadata = {"format": "xml", "security_error": str(e)}
        except ParseError as e:
            text = content
            metadata = {"format": "xml", "parse_error": str(e)}

Let me know if you'd rather see this collapsed under a single error key — happy to change the shape if it clashes with anything downstream.

coderabbitai · 2026-07-03T04:56:51Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
fd xml_loader.py
echo "---"
fd test_xml_loader_security.py

Length of output: 374

🏁 Script executed:

#!/bin/bash
cat -n lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py
echo "---TEST---"
cat -n lib/crewai-tools/tests/rag/test_xml_loader_security.py

Length of output: 7657

@camgrimsec Thanks for the quick follow-up — I checked d959e73 and it looks correct:

DefusedXmlException is imported from defusedxml.common, and since EntitiesForbidden (raised for both XXE and billion-laughs) is a subclass of it, the except DefusedXmlException block will catch both attack vectors.
The except-block ordering is safe: DefusedXmlException and ParseError are unrelated hierarchies, so there's no risk of one shadowing the other.
Emptying content in the security case makes sense — since the entity was never resolved, there's nothing meaningful to return, and returning it would risk giving a false sense of "safe" content back to callers.
test_malformed_xml_still_returns_parse_error is a good addition; it locks in that ordinary malformed XML doesn't get misclassified as a security issue.

On your question: I'd keep security_error and parse_error as separate keys rather than collapsing them into one error field. They represent semantically different failure modes ("actively refused for safety" vs. "just malformed"), and any existing code checking for parse_error specifically would silently break if it were renamed/merged. Your current shape is the right call — no changes needed from my side.

Nice work closing this gap.

[approve]

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py (1)
43-48: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Parse content unconditionally. The parse(source_ref) fallback can turn a URL input into an uncaught FileNotFoundError/OSError when the fetched payload doesn’t satisfy the < check (for example, a leading BOM). Parsing the already-loaded string keeps this inside the loader’s error handling.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py` around lines 43
- 48, In _parse_xml, avoid falling back to parse(source_ref) based on the
leading-< check; always parse the already loaded content string so XML inputs
stay within the loader’s existing exception handling. Update the logic around
fromstring and remove the source_ref-based parse path to prevent uncaught
FileNotFoundError/OSError for valid payloads that don’t start with “<”.

🧹 Nitpick comments (1)

lib/crewai-tools/tests/rag/test_xml_loader_security.py (1)
18-24: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Docstring overstates stdlib XXE risk.

Per Python's own XML vulnerabilities documentation and defusedxml's docs, stdlib xml.etree.ElementTree does not resolve external/SYSTEM entities — it raises a ParseError instead of leaking file contents. The real value of this migration here is blocking internal entity expansion (billion laughs, covered by the next test) and providing defense-in-depth/consistent containment behavior, not fixing a stdlib file-exfiltration bug that doesn't exist in ElementTree.
✏️ Suggested docstring correction
-        The stdlib xml.etree parser would resolve `SYSTEM` entities (leaking
-        file contents). defusedxml refuses; XMLLoader captures that and
-        surfaces `security_error` in metadata.
+        defusedxml refuses to process any declared entity by default
+        (including this external `SYSTEM` reference); XMLLoader captures
+        that and surfaces `security_error` in metadata instead of raising.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/crewai-tools/tests/rag/test_xml_loader_security.py` around lines 18 - 24,
Update the docstring in test_rejects_external_entity_expansion to remove the
claim that stdlib xml.etree resolves SYSTEM external entities and leaks file
contents. Keep the test intent aligned with XMLLoader’s security behavior by
describing that defusedxml rejects the external entity and that the loader
surfaces security_error in metadata, referencing the test method name and the
XMLLoader behavior it verifies.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py`:
- Around line 43-48: In _parse_xml, avoid falling back to parse(source_ref)
based on the leading-< check; always parse the already loaded content string so
XML inputs stay within the loader’s existing exception handling. Update the
logic around fromstring and remove the source_ref-based parse path to prevent
uncaught FileNotFoundError/OSError for valid payloads that don’t start with “<”.

---

Nitpick comments:
In `@lib/crewai-tools/tests/rag/test_xml_loader_security.py`:
- Around line 18-24: Update the docstring in
test_rejects_external_entity_expansion to remove the claim that stdlib xml.etree
resolves SYSTEM external entities and leaks file contents. Keep the test intent
aligned with XMLLoader’s security behavior by describing that defusedxml rejects
the external entity and that the loader surfaces security_error in metadata,
referencing the test method name and the XMLLoader behavior it verifies.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 9f5e6627-5332-4adf-a874-963ea145ddf6

📥 Commits

Reviewing files that changed from the base of the PR and between 6c4afa5 and d959e73.

📒 Files selected for processing (2)

lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py
lib/crewai-tools/tests/rag/test_xml_loader_security.py

The old dual-path logic in _parse_xml fell back to parse(source_ref) when content did not start with '<'. For any URL source that returned a payload with a BOM prefix, leading whitespace, or an XML declaration that failed the startswith check, source_ref got handed to parse() as a filesystem path and raised an uncaught FileNotFoundError / OSError. load_from_url already produced a string in memory, so there was no reason to reach back through parse() at all. Drop the fallback and always parse the already-loaded content string so error handling stays inside the existing DefusedXmlException / ParseError branches. Remove the now-unused parse import. Adds test_url_payload_with_bom_is_parsed_not_treated_as_path to lock in the BOM-prefixed case. Also tightens the docstring in test_rejects_external_entity_expansion to describe what defusedxml actually does (refuses declared entities) rather than overclaiming that stdlib ElementTree leaks SYSTEM entity contents (it does not; it raises ParseError).

camgrimsec · 2026-07-03T05:10:32Z

Thanks @coderabbitai — both good calls.

Pushed 809b14d addressing the two items from this pass:

Major — always parse content as a string (43-48)

You're right that the parse(source_ref) fallback was a latent bug: for any URL payload where content.strip().startswith("<") returned False (BOM prefix, leading whitespace, XML declaration edge case), source_ref — a URL string — got handed to parse() as a filesystem path and would raise an uncaught FileNotFoundError. load_from_url already produces the payload in memory, so the fallback was never doing useful work anyway. Dropped it and the now-unused parse import.

Added test_url_payload_with_bom_is_parsed_not_treated_as_path covering the BOM case explicitly so the regression is guarded.

def _parse_xml(self, content: str, source_ref: str) -> LoaderResult:
    try:
        root = fromstring(content)
        ...

Trivial — docstring on test_rejects_external_entity_expansion

Also correct — stdlib xml.etree.ElementTree raises ParseError on external SYSTEM entities, not a silent file-content leak. Rewrote the docstring to describe what defusedxml actually does (refuses declared entities by default) rather than the overclaimed stdlib behavior.

Local check of all six tests still passes: XXE refusal, billion-laughs refusal, safe parse, malformed-XML parse_error, BOM-prefixed URL payload, and URL-sourced XXE.

coderabbitai · 2026-07-03T05:11:42Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
fd xml_loader.py
echo "---"
fd test_xml_loader_security.py

Length of output: 374

🏁 Script executed:

#!/bin/bash
cat -n lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py
echo "---TEST---"
cat -n lib/crewai-tools/tests/rag/test_xml_loader_security.py

Length of output: 8544

@camgrimsec Verified 809b14d — looks solid.

Confirmed _parse_xml now calls fromstring(content) unconditionally with no parse(source_ref) fallback, and the parse import is gone. That fully closes the BOM/leading-whitespace/URL-as-filesystem-path bug — there's no code path left that could hand a URL string to parse().
test_url_payload_with_bom_is_parsed_not_treated_as_path correctly exercises the scenario: mocks load_from_url, sends a BOM-prefixed payload, and asserts the content parses with no parse_error/security_error. Good regression coverage.
The docstring rewrite on test_rejects_external_entity_expansion is accurate now — describes defusedxml's default entity-refusal rather than overclaiming stdlib behavior.

Nice cleanup — this was a good catch on your end regarding the always-in-memory parsing. No further changes needed from my side.

[approve]

Merge branch 'main' into fix-xml-loader-xxe

6c4afa5

coderabbitai Bot reviewed Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(crewai-tools): use defusedxml in RAG XMLLoader to prevent XXE#6446

fix(crewai-tools): use defusedxml in RAG XMLLoader to prevent XXE#6446
camgrimsec wants to merge 4 commits into
crewAIInc:mainfrom
camgrimsec:fix-xml-loader-xxe

camgrimsec commented Jul 3, 2026

Uh oh!

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

camgrimsec commented Jul 3, 2026

Uh oh!

coderabbitai Bot commented Jul 3, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

camgrimsec commented Jul 3, 2026

Uh oh!

coderabbitai Bot commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

camgrimsec commented Jul 3, 2026

Uh oh!

coderabbitai Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

camgrimsec commented Jul 3, 2026

Uh oh!

coderabbitai Bot commented Jul 3, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

camgrimsec commented Jul 3, 2026

Uh oh!

coderabbitai Bot commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading