fix(crewai-tools): use defusedxml in RAG XMLLoader to prevent XXE#6446
fix(crewai-tools): use defusedxml in RAG XMLLoader to prevent XXE#6446camgrimsec wants to merge 4 commits into
Conversation
XMLLoader parses XML content sourced from user-supplied URLs and file paths via xml.etree.ElementTree, which allows external entity resolution and unbounded entity expansion. Documents containing DOCTYPE declarations can trigger XXE (file read / SSRF) or billion-laughs denial of service. Swap fromstring/parse over to defusedxml.ElementTree, which is API compatible and rejects entity declarations by default. Add defusedxml as a runtime dependency and drop the two noqa: S314 markers now that the underlying warning no longer applies. Adds tests/rag/test_xml_loader_security.py covering: - external-entity documents raise EntitiesForbidden - nested-entity (billion-laughs) documents raise EntitiesForbidden - benign XML still parses to the expected tree - malicious XML fetched from a URL is rejected
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis PR adds ChangesXXE Hardening
Sequence Diagram(s)sequenceDiagram
participant Test as TestXMLLoaderSecurity
participant Loader as XMLLoader
participant Defused as defusedxml.ElementTree
Test->>Loader: load(malicious XML)
Loader->>Defused: fromstring(content)
Defused-->>Loader: DefusedXmlException
Loader-->>Test: empty content + security_error metadata
Test->>Loader: load(safe XML)
Loader->>Defused: fromstring(content)
Defused-->>Loader: parsed root element
Loader-->>Test: content + format/root_tag metadata
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py (1)
42-58: 🩺 Stability & Availability | 🟡 Minor | ⚡ Quick winCatch defusedxml exceptions at the loader boundary.
XMLLoader._parse_xml()only handlesParseError, soEntitiesForbidden/DTDForbidden/ExternalReferenceForbiddencan escape throughRAG.add()and directXMLLoader.load()calls. Add aDefusedXmlExceptioncatch (or equivalent handling) to keep a bad XML document from aborting the whole add path.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py` around lines 42 - 58, XMLLoader._parse_xml currently only catches ParseError, so defusedxml safety exceptions can still escape and break RAG.add and XMLLoader.load. Update _parse_xml in XMLLoader to also catch DefusedXmlException (or the specific defusedxml exception base used by the loader) alongside ParseError, and fall back to returning the raw content with parse_error metadata so malformed or unsafe XML is contained at the loader boundary.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py`:
- Around line 42-58: XMLLoader._parse_xml currently only catches ParseError, so
defusedxml safety exceptions can still escape and break RAG.add and
XMLLoader.load. Update _parse_xml in XMLLoader to also catch DefusedXmlException
(or the specific defusedxml exception base used by the loader) alongside
ParseError, and fall back to returning the raw content with parse_error metadata
so malformed or unsafe XML is contained at the loader boundary.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: d128567f-9c24-49cd-a8e9-95cdbc5c3355
📒 Files selected for processing (3)
lib/crewai-tools/pyproject.tomllib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.pylib/crewai-tools/tests/rag/test_xml_loader_security.py
XMLLoader._parse_xml only caught xml.etree.ParseError, so any DefusedXmlException (EntitiesForbidden / DTDForbidden / ExternalReferenceForbidden) raised while parsing a hostile document would escape the loader and abort the caller. That gives a single malicious XML source the power to break a whole RAG.add() run. Catch DefusedXmlException alongside ParseError and surface it through metadata as 'security_error' (distinct from 'parse_error' so callers can tell malformed XML from actively-refused XML). Content is emptied in the security case since we deliberately did not evaluate any of it. Tests updated to assert the new contract and a new test_malformed_xml_still_returns_parse_error case guards the existing ParseError path so the two error kinds do not get conflated.
|
Thanks @coderabbitai — good catch, this was a real gap. Pushed Also added Diff for the follow-up: except DefusedXmlException as e:
text = ""
metadata = {"format": "xml", "security_error": str(e)}
except ParseError as e:
text = content
metadata = {"format": "xml", "parse_error": str(e)}Let me know if you'd rather see this collapsed under a single |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
fd xml_loader.py
echo "---"
fd test_xml_loader_security.pyLength of output: 374 🏁 Script executed: #!/bin/bash
cat -n lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py
echo "---TEST---"
cat -n lib/crewai-tools/tests/rag/test_xml_loader_security.pyLength of output: 7657
On your question: I'd keep Nice work closing this gap. [approve] |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py (1)
43-48: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy liftParse
contentunconditionally. Theparse(source_ref)fallback can turn a URL input into an uncaughtFileNotFoundError/OSErrorwhen the fetched payload doesn’t satisfy the<check (for example, a leading BOM). Parsing the already-loaded string keeps this inside the loader’s error handling.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py` around lines 43 - 48, In _parse_xml, avoid falling back to parse(source_ref) based on the leading-< check; always parse the already loaded content string so XML inputs stay within the loader’s existing exception handling. Update the logic around fromstring and remove the source_ref-based parse path to prevent uncaught FileNotFoundError/OSError for valid payloads that don’t start with “<”.
🧹 Nitpick comments (1)
lib/crewai-tools/tests/rag/test_xml_loader_security.py (1)
18-24: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winDocstring overstates stdlib XXE risk.
Per Python's own XML vulnerabilities documentation and defusedxml's docs, stdlib
xml.etree.ElementTreedoes not resolve external/SYSTEMentities — it raises aParseErrorinstead of leaking file contents. The real value of this migration here is blocking internal entity expansion (billion laughs, covered by the next test) and providing defense-in-depth/consistent containment behavior, not fixing a stdlib file-exfiltration bug that doesn't exist inElementTree.✏️ Suggested docstring correction
- The stdlib xml.etree parser would resolve `SYSTEM` entities (leaking - file contents). defusedxml refuses; XMLLoader captures that and - surfaces `security_error` in metadata. + defusedxml refuses to process any declared entity by default + (including this external `SYSTEM` reference); XMLLoader captures + that and surfaces `security_error` in metadata instead of raising.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/crewai-tools/tests/rag/test_xml_loader_security.py` around lines 18 - 24, Update the docstring in test_rejects_external_entity_expansion to remove the claim that stdlib xml.etree resolves SYSTEM external entities and leaks file contents. Keep the test intent aligned with XMLLoader’s security behavior by describing that defusedxml rejects the external entity and that the loader surfaces security_error in metadata, referencing the test method name and the XMLLoader behavior it verifies.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py`:
- Around line 43-48: In _parse_xml, avoid falling back to parse(source_ref)
based on the leading-< check; always parse the already loaded content string so
XML inputs stay within the loader’s existing exception handling. Update the
logic around fromstring and remove the source_ref-based parse path to prevent
uncaught FileNotFoundError/OSError for valid payloads that don’t start with “<”.
---
Nitpick comments:
In `@lib/crewai-tools/tests/rag/test_xml_loader_security.py`:
- Around line 18-24: Update the docstring in
test_rejects_external_entity_expansion to remove the claim that stdlib xml.etree
resolves SYSTEM external entities and leaks file contents. Keep the test intent
aligned with XMLLoader’s security behavior by describing that defusedxml rejects
the external entity and that the loader surfaces security_error in metadata,
referencing the test method name and the XMLLoader behavior it verifies.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 9f5e6627-5332-4adf-a874-963ea145ddf6
📒 Files selected for processing (2)
lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.pylib/crewai-tools/tests/rag/test_xml_loader_security.py
The old dual-path logic in _parse_xml fell back to parse(source_ref) when content did not start with '<'. For any URL source that returned a payload with a BOM prefix, leading whitespace, or an XML declaration that failed the startswith check, source_ref got handed to parse() as a filesystem path and raised an uncaught FileNotFoundError / OSError. load_from_url already produced a string in memory, so there was no reason to reach back through parse() at all. Drop the fallback and always parse the already-loaded content string so error handling stays inside the existing DefusedXmlException / ParseError branches. Remove the now-unused parse import. Adds test_url_payload_with_bom_is_parsed_not_treated_as_path to lock in the BOM-prefixed case. Also tightens the docstring in test_rejects_external_entity_expansion to describe what defusedxml actually does (refuses declared entities) rather than overclaiming that stdlib ElementTree leaks SYSTEM entity contents (it does not; it raises ParseError).
|
Thanks @coderabbitai — both good calls. Pushed Major — always parse You're right that the Added def _parse_xml(self, content: str, source_ref: str) -> LoaderResult:
try:
root = fromstring(content)
...Trivial — docstring on Also correct — stdlib Local check of all six tests still passes: XXE refusal, billion-laughs refusal, safe parse, malformed-XML |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
fd xml_loader.py
echo "---"
fd test_xml_loader_security.pyLength of output: 374 🏁 Script executed: #!/bin/bash
cat -n lib/crewai-tools/src/crewai_tools/rag/loaders/xml_loader.py
echo "---TEST---"
cat -n lib/crewai-tools/tests/rag/test_xml_loader_security.pyLength of output: 8544
Nice cleanup — this was a good catch on your end regarding the always-in-memory parsing. No further changes needed from my side. [approve] |
XMLLoader parses XML content sourced from user-supplied URLs and file paths via xml.etree.ElementTree, which allows external entity resolution and unbounded entity expansion. Documents containing DOCTYPE declarations can trigger XXE (file read / SSRF) or billion-laughs denial of service.
Swap fromstring/parse over to defusedxml.ElementTree, which is API compatible and rejects entity declarations by default. Add defusedxml as a runtime dependency and drop the two noqa: S314 markers now that the underlying warning no longer applies.
Adds tests/rag/test_xml_loader_security.py covering: