Skip to content

fix(bio-research): validate download hosts and URL-encode accessions in ncbi_utils (#166)#304

Open
jpatel3 wants to merge 1 commit into
anthropics:mainfrom
jpatel3:fix/ncbi-download-host-allowlist-url-encoding
Open

fix(bio-research): validate download hosts and URL-encode accessions in ncbi_utils (#166)#304
jpatel3 wants to merge 1 commit into
anthropics:mainfrom
jpatel3:fix/ncbi-download-host-allowlist-url-encoding

Conversation

@jpatel3

@jpatel3 jpatel3 commented Jun 8, 2026

Copy link
Copy Markdown

Summary

Addresses the remaining defense-in-depth hardening from #166 in
bio-research/skills/nextflow-development/scripts/utils/ncbi_utils.py.

Issue #166 raised three concerns. Issue 1 (HTTP→HTTPS on the FASTQ download line) is already handled by #173, so it is intentionally not touched here. This PR covers the two that remain open:

Issue 2 — No domain validation on download URLs

Download URLs are built from the fastq_ftp field of the ENA API response and streamed to disk by download_file(). There was no check that they point at ENA/NCBI hosts.

  • Added a trusted-host allowlist (.ebi.ac.uk, .ncbi.nlm.nih.gov, .nih.gov).
  • download_file() now rejects any URL whose host isn't on the allowlist before creating directories or writing bytes.
  • Suffix-spoofing lookalikes (e.g. ftp.sra.ebi.ac.uk.evil.com) are rejected — the check matches on host boundaries, not substrings.

Issue 3 — Missing URL encoding on API parameters

User-supplied accessions were interpolated into Entrez/ENA URLs without escaping.

  • geo_id, study_accession, bioproject, and pmid are now passed through urllib.parse.quote(..., safe='') at every URL-construction site. Literal Entrez field syntax ([Accession], [GEO]) stays outside the encoded value, so queries still work.
  • NCBI-returned numeric UIDs are left as-is (not user input).

Tests

Added scripts/tests/test_ncbi_utils.py:

  • accepts known ENA/NCBI hosts, rejects untrusted + suffix-spoofed hosts
  • download_file() refuses an untrusted host and writes nothing
  • accession encoding neutralizes query-string metacharacters
$ python3 -m unittest tests.test_ncbi_utils -v
Ran 4 tests in 0.000s
OK

The test loads ncbi_utils directly via importlib so it runs without the package's optional yaml dependency.

Closes #166 (together with #173 for Issue 1).

…in ncbi_utils

Hardens the GEO/SRA data fetcher (issue anthropics#166, defense-in-depth):

- Add a trusted-host allowlist (.ebi.ac.uk / .ncbi.nlm.nih.gov / .nih.gov)
  and enforce it in download_file() before any bytes are written. Download
  URLs are built from the ENA API's fastq_ftp field; this prevents a
  tampered or compromised API response from steering the downloader to an
  arbitrary host. Suffix-spoofed lookalikes (e.g. ftp.sra.ebi.ac.uk.evil.com)
  are rejected.
- URL-encode user-supplied accessions (geo_id, study_accession, bioproject,
  pmid) with urllib.parse.quote() at every Entrez/ENA URL construction site,
  so query-string metacharacters can't alter the request.
- Add unit tests for the host allowlist, the download_file guard, and the
  accession encoding.

Issue anthropics#166's HTTP->HTTPS change (line 343) is handled separately by anthropics#173 and
is intentionally not touched here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Security] SSRF + path traversal chain in bio-research ncbi_utils.py and sra_geo_fetch.py

1 participant