fix(bio-research): validate download hosts and URL-encode accessions in ncbi_utils (#166) by jpatel3 · Pull Request #304 · anthropics/knowledge-work-plugins

jpatel3 · 2026-06-08T16:58:18Z

Summary

Addresses the remaining defense-in-depth hardening from #166 in
bio-research/skills/nextflow-development/scripts/utils/ncbi_utils.py.

Issue #166 raised three concerns. Issue 1 (HTTP→HTTPS on the FASTQ download line) is already handled by #173, so it is intentionally not touched here. This PR covers the two that remain open:

Issue 2 — No domain validation on download URLs

Download URLs are built from the fastq_ftp field of the ENA API response and streamed to disk by download_file(). There was no check that they point at ENA/NCBI hosts.

Added a trusted-host allowlist (.ebi.ac.uk, .ncbi.nlm.nih.gov, .nih.gov).
download_file() now rejects any URL whose host isn't on the allowlist before creating directories or writing bytes.
Suffix-spoofing lookalikes (e.g. ftp.sra.ebi.ac.uk.evil.com) are rejected — the check matches on host boundaries, not substrings.

Issue 3 — Missing URL encoding on API parameters

User-supplied accessions were interpolated into Entrez/ENA URLs without escaping.

geo_id, study_accession, bioproject, and pmid are now passed through urllib.parse.quote(..., safe='') at every URL-construction site. Literal Entrez field syntax ([Accession], [GEO]) stays outside the encoded value, so queries still work.
NCBI-returned numeric UIDs are left as-is (not user input).

Tests

Added scripts/tests/test_ncbi_utils.py:

accepts known ENA/NCBI hosts, rejects untrusted + suffix-spoofed hosts
download_file() refuses an untrusted host and writes nothing
accession encoding neutralizes query-string metacharacters

$ python3 -m unittest tests.test_ncbi_utils -v
Ran 4 tests in 0.000s
OK

The test loads ncbi_utils directly via importlib so it runs without the package's optional yaml dependency.

Closes #166 (together with #173 for Issue 1).

…in ncbi_utils Hardens the GEO/SRA data fetcher (issue anthropics#166, defense-in-depth): - Add a trusted-host allowlist (.ebi.ac.uk / .ncbi.nlm.nih.gov / .nih.gov) and enforce it in download_file() before any bytes are written. Download URLs are built from the ENA API's fastq_ftp field; this prevents a tampered or compromised API response from steering the downloader to an arbitrary host. Suffix-spoofed lookalikes (e.g. ftp.sra.ebi.ac.uk.evil.com) are rejected. - URL-encode user-supplied accessions (geo_id, study_accession, bioproject, pmid) with urllib.parse.quote() at every Entrez/ENA URL construction site, so query-string metacharacters can't alter the request. - Add unit tests for the host allowlist, the download_file guard, and the accession encoding. Issue anthropics#166's HTTP->HTTPS change (line 343) is handled separately by anthropics#173 and is intentionally not touched here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bio-research): validate download hosts and URL-encode accessions in ncbi_utils (#166)#304

fix(bio-research): validate download hosts and URL-encode accessions in ncbi_utils (#166)#304
jpatel3 wants to merge 1 commit into
anthropics:mainfrom
jpatel3:fix/ncbi-download-host-allowlist-url-encoding

jpatel3 commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jpatel3 commented Jun 8, 2026

Summary

Issue 2 — No domain validation on download URLs

Issue 3 — Missing URL encoding on API parameters

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant