Skip to content

CSCfi/sd-search-api

Repository files navigation

SD Search API

Description

The SD Search API enables search across different datasets.

Supported configurations:

  • Bigpicture image search

Dependencies

  • PostgreSQL: database for search metadata
  • OpenSearch: search indexes build from the search metadata
  • Snowstorm: SNOMED CT ontology server

OpenSearch

OpenSearch indexes:

  • Bigpicture: bp-image-index.json

Development

Setup

Install uv, then create the virtualenv and install all dependencies:

uv sync --dev

Activate the pre-commit hook to run tox before every commit:

uv run pre-commit install

Formatting and linting

tox -e ruff
tox -e mypy

Unit tests

tox -e pytest

Integration tests

Integration tests require Postgres and OpenSearch to be running. Start them with Docker Compose:

docker compose --env-file tests/integration/.env --profile dev up --build

Then run:

uv run pytest tests/integration/

Environmental variables are defined in tests/integration/.env.

External dependencies

Snowstorm

Snowstorm is a SNOMED CT terminology server used by the SD Search API to resolve SNOMED CT terms to concepts.

  • A Snowstorm instance is available at https://snowstorm.rahtiapp.fi.
  • A SNOMED browser instance is available at: https://snomed-browser.rahtiapp.fi/.

Data import

This is only needed when importing a new SNOMED CT release into the shared instance. The full procedure is described in https://github.com/IHTSDO/snowstorm/blob/master/docs/loading-snomed.md.

First check that the Snowstorm service is healthy:

curl https://snowstorm.rahtiapp.fi/actuator/health

Expected output:

{"status":"UP","groups":["liveness","readiness"]}%       

Create import job

curl -i --location 'https://snowstorm.rahtiapp.fi/imports' \
  --header 'Content-Type: application/json' \
  --data '{"type":"SNAPSHOT","branchPath":"MAIN","createCodeSystemVersion":true}'

Example output:

HTTP/1.1 201 
location: https://snowstorm.rahtiapp.fi/imports/<ID>

Get the import ID (e.g. f0801e81-3740-48bd-bc3e-848c7aa7468e) from the response location header and define the IMPORT_ID environmental variable:

export IMPORT_ID=<ID>

Import SNOMED release

Upload SNOMED release file (e.g. SnomedCT_InternationalRF2_PRODUCTION_20260601T120000Z.zip):

curl --location -X POST "https://snowstorm.rahtiapp.fi/imports/${IMPORT_ID}/archive" \
  -F "file=@<SNOMED release file>"

The upload and import can take several hours. Poll the import status until status is COMPLETED or until the import job is no longer available:

curl --location "https://snowstorm.rahtiapp.fi/imports/${IMPORT_ID}"

Example output while running:

{
  "status" : "RUNNING",
  "type" : "SNAPSHOT",
  "branchPath" : "MAIN",
  "internalRelease" : false,
  "moduleIds" : [ ],
  "createCodeSystemVersion" : true
}

You can monitor the import progress also from the logs:

oc logs -f deployment/snowstorm

Once finished, verify that the import has been completed.

Check the imported versions:

curl -s https://snowstorm.rahtiapp.fi/codesystems/SNOMEDCT/versions | jq '.items[] | {version, branchPath}'

Example output:

{
  "version": "2026-06-01",
  "branchPath": "MAIN/2026-06-01"
}

Check the MAIN branch:

curl -s https://snowstorm.rahtiapp.fi/branches/MAIN                                     

Example output:

{
  "path" : "MAIN",
  "state" : "UP_TO_DATE",
  "containsContent" : true,
  "locked" : false,
  "creation" : "2026-06-11T05:12:34.688Z",
  "base" : "2026-06-11T05:12:34.688Z",
  "head" : "2026-06-11T05:52:38.457Z",
  "creationTimestamp" : 1781154754688,
  "baseTimestamp" : 1781154754688,
  "headTimestamp" : 1781157158457,
  ...
}

Get number of concepts:

curl -s "https://snowstorm.rahtiapp.fi/MAIN/concepts?limit=1&active=true" | jq '{total}'

Example output:

{
  "total": 532824
}

Get a concept:

curl -s "https://snowstorm.rahtiapp.fi/MAIN/concepts/337915000" | jq '{conceptId, active, fsn: .fsn.term}'

Example output:

{
  "conceptId": "337915000",
  "active": true,
  "fsn": "Homo sapiens (organism)"
}

Data loading

Bigpicture

Load datasets

Load a single dataset directory (default):

uv run python scripts/admin.py Bigpicture load /path/to/dataset/ --load

Load from a parent directory containing multiple dataset subdirectories:

uv run python scripts/admin.py Bigpicture load /path/to/datasets/ --multi-dir --load

Omit --load to parse XMLs without loading them to the database.

To also sync to OpenSearch immediately after loading, add --sync:

uv run python scripts/admin.py Bigpicture load /path/to/datasets/ --multi-dir --load --sync

Refresh SNOMED CT preferred terms

After a new SNOMED CT release, update the stored preferred terms to match the new release. The preferred-terms cache is shared across deployments, so this command is not tied to a specific one:

uv run python scripts/admin.py snomed refresh

Generate the OpenSearch index

The OpenSearch index mapping (search_api/opensearch/bigpicture/bp-image-index.json) is is generated from the filtered and non-filtered field definitions, so that field names and types stay in sync with them. After changing them, regenerate and commit the file:

uv run python scripts/admin.py Bigpicture generate-index

An unit test fails if this file is different from a freshy generated one.

Create the OpenSearch index in a new environment

generate-index only writes the mapping to a local file — it does not create the index in OpenSearch. A new OpenSearch instance needs the index created from that mapping before the first --sync. If documents are synced into an index that doesn't exist yet, OpenSearch silently auto-creates it with a dynamic mapping (e.g. keyword fields become text, and nested fields become plain objects), which breaks aggregations and nested queries in ways that only surface later, disconnected from the actual cause.

Create the index explicitly:

uv run python scripts/admin.py --env-file <env> Bigpicture create-index

This fails loudly if the index already exists, rather than silently leaving a stale mapping in place. If an index was already auto-created with the wrong mapping, OpenSearch cannot change an existing field's type in place, so it must be deleted and recreated, and previously-synced documents must be resynced:

curl -X DELETE https://<opensearch-host>:9200/bp-image-index -u <user>:<password>
uv run python scripts/admin.py --env-file <env> Bigpicture create-index
# Reset sync state so the next --sync repopulates the recreated index:
#   UPDATE document SET synced_at = NULL;
uv run python scripts/admin.py --env-file <env> Bigpicture load <dir> --load --sync

LLM search

The experimental Bigpicture LLM search endpoint uses a small local Ollama model. Install and start it before running the API:

brew install ollama
ollama pull qwen2.5:14b
ollama serve

The /ai/query endpoint accepts a query for the LLM search. The LLM translates the query text into Beacon V2 filters and returns structured results.

Example:

curl -X POST "http://localhost:8000/ai/query" \
  -H "Content-Type: application/json" \
  -d '{"query": "images for human females"}'

Performance tests

See tests/performance/README.md.

About

Sensitive data search API.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors