fix(bio-research): validate download hosts and URL-encode accessions in ncbi_utils (#166)#304
Open
jpatel3 wants to merge 1 commit into
Open
Conversation
…in ncbi_utils Hardens the GEO/SRA data fetcher (issue anthropics#166, defense-in-depth): - Add a trusted-host allowlist (.ebi.ac.uk / .ncbi.nlm.nih.gov / .nih.gov) and enforce it in download_file() before any bytes are written. Download URLs are built from the ENA API's fastq_ftp field; this prevents a tampered or compromised API response from steering the downloader to an arbitrary host. Suffix-spoofed lookalikes (e.g. ftp.sra.ebi.ac.uk.evil.com) are rejected. - URL-encode user-supplied accessions (geo_id, study_accession, bioproject, pmid) with urllib.parse.quote() at every Entrez/ENA URL construction site, so query-string metacharacters can't alter the request. - Add unit tests for the host allowlist, the download_file guard, and the accession encoding. Issue anthropics#166's HTTP->HTTPS change (line 343) is handled separately by anthropics#173 and is intentionally not touched here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses the remaining defense-in-depth hardening from #166 in
bio-research/skills/nextflow-development/scripts/utils/ncbi_utils.py.Issue #166 raised three concerns. Issue 1 (HTTP→HTTPS on the FASTQ download line) is already handled by #173, so it is intentionally not touched here. This PR covers the two that remain open:
Issue 2 — No domain validation on download URLs
Download URLs are built from the
fastq_ftpfield of the ENA API response and streamed to disk bydownload_file(). There was no check that they point at ENA/NCBI hosts..ebi.ac.uk,.ncbi.nlm.nih.gov,.nih.gov).download_file()now rejects any URL whose host isn't on the allowlist before creating directories or writing bytes.ftp.sra.ebi.ac.uk.evil.com) are rejected — the check matches on host boundaries, not substrings.Issue 3 — Missing URL encoding on API parameters
User-supplied accessions were interpolated into Entrez/ENA URLs without escaping.
geo_id,study_accession,bioproject, andpmidare now passed throughurllib.parse.quote(..., safe='')at every URL-construction site. Literal Entrez field syntax ([Accession],[GEO]) stays outside the encoded value, so queries still work.Tests
Added
scripts/tests/test_ncbi_utils.py:download_file()refuses an untrusted host and writes nothingThe test loads
ncbi_utilsdirectly viaimportlibso it runs without the package's optionalyamldependency.Closes #166 (together with #173 for Issue 1).