Production-grade automation for self-hosting Qwen2.5-Coder-14B-Instruct-AWQ on a dual-GPU RTX 3060 setup using vLLM with Tensor Parallelism.
| Component | Specification |
|---|---|
| GPUs | 2× NVIDIA RTX 3060 12GB |
| Total VRAM | 24 GB |
| Parallelism | Tensor Parallel (size=2) |
| Host OS | Ubuntu 22.04 LTS |
| Model | Qwen/Qwen2.5-Coder-14B-Instruct-AWQ |
| Quantization | AWQ (4-bit activation-aware) |
| Context | 16,384 tokens (can be raised to 32,768 on a headless GPU 0 — see Display Server Impact) |
This repository deploys a single-node vLLM OpenAI-compatible API server for use on a private homelab network.
The intended use case is:
- Run vLLM on one GPU-equipped Ubuntu host
- Expose the vLLM API on the private LAN
- Connect editor/CLI tools such as Zed, Continue (VS Code/JetBrains), or Aider
- Optionally connect a local web UI for direct chat interaction
This repo does not currently target:
- Multi-node vLLM
- Ray-based distributed inference
- Kubernetes
- Public internet exposure
- Production authentication or TLS
- Deployment Scope
- Repository Structure
- Deploying
- Manual Step-by-Step Setup (Advanced)
- API Usage
- Client & Editor Integrations
- Tuning Reference
- Server Management
- Open WebUI Support (Optional)
- Security Notes
- Development & Code Quality
- License
vllm-local-developer-stack/
├── .gitignore # Ignores WORKLOG.md, scripts/deploy/.env, generated override files, benchmark results
├── .pre-commit-config.yaml # Pre-commit framework configuration
├── .secrets.baseline # detect-secrets baseline — known false positives
├── README.md
├── deploy-artifacts/
│ ├── docker-compose.yml # vLLM service definition (image, GPU reservation, healthcheck)
│ ├── docker-compose.open-webui.yml # Optional WebUI service and volume definition
│ └── docker-compose.override.yml # Auto-generated by deploy.sh — gitignored, do not edit
├── git-hooks/
│ └── check-commit-msg-secrets.py # commit-msg hook: scans commit messages for secrets
└── scripts/
├── prereqs/
│ └── install-prereqs.sh # Idempotent dependency installer (drivers, Docker, toolkit)
├── deploy/
│ ├── deploy.sh # Single entry point — orchestrates the full deployment
│ ├── .env.example # Annotated parameter reference — copy to .env
│ ├── validate-system.sh # Pre-flight hardware validation
│ ├── validate-vram.sh # Live startup telemetry monitor
│ ├── setup-zed.sh # Zed IDE integration hook (primary supported IDE)
│ ├── setup-continue.sh # Continue extension hook (VS Code & JetBrains)
│ ├── setup-aider.sh # Aider CLI integration hook
│ ├── smoke-test.sh # Smoke test verification for endpoints
│ └── teardown.sh # Graceful server teardown wrapper (stops and deletes containers)
└── tuning/
├── tune-inference.sh # Hardware-sensing config generator
├── check-bottlenecks.sh # Hardware & OS performance advisor (--json for machine-readable output)
├── snapshot-diagnostics.sh # Read-only GPU state + log capture (run before teardown.sh when debugging)
├── benchmark.sh # Single-stream token throughput evaluator
├── load-test.sh # Concurrent-request load tester (req/s, latency percentiles)
└── compare-benchmarks.sh # Regression detector across benchmark-results/ history
benchmark.sh and load-test.sh each write a timestamped JSON record to
benchmark-results/ (created on first run) so tuning changes can be
compared over time with compare-benchmarks.sh.
cp scripts/deploy/.env.example scripts/deploy/.env
$EDITOR scripts/deploy/.env # set BIND_HOST, HF_CACHE_DIR at minimumSee scripts/deploy/.env.example for every available option and its explanation.
sudo bash scripts/deploy/deploy.shdeploy.sh is the single entry point for standing up the stack. It is
not something you run after validating and tuning the system — it runs
those steps for you, in this order. Root is required end-to-end (package
installation and enabling the Docker service both need it):
| Step | What happens | Blocking? |
|---|---|---|
| 1 | Load and validate scripts/deploy/.env |
✅ Aborts if .env is missing |
| 2 | Resolve BIND_HOST (auto-detects your LAN IP if unset) |
— |
| 3 | If BIND_HOST is a non-loopback address (e.g. 10.1.10.17, 192.168.0.x), check for an active firewall (ufw/firewalld) and open the API port if it isn't already allowed; if no firewall is active, or BIND_HOST is 127.0.0.1, skip — nothing to do |
— |
| 4 | Run install-prereqs.sh — drivers, Docker, nvidia-container-toolkit (idempotent; prompts before installing anything missing, never touches what's already there) |
✅ |
| 5 | Run validate-system.sh — GPU ↔ Docker connectivity, PCIe link quality |
✅ |
| 6 | Run check-bottlenecks.sh — performance advisory |
|
| 7 | Run tune-inference.sh — updates only the GPU-tuned keys in scripts/deploy/.env in place, diffing against any existing values first |
✅ |
| 8 | Re-apply your user-set values on top (your settings always win over auto-tuning) | — |
| 9 | Generate docker-compose.override.yml with the fully-resolved vLLM command |
— |
| 10 | Ensure docker.service is enabled at boot, so the container (restart: unless-stopped) comes back up after a host reboot, not just a plain restart |
— |
| 11 | docker compose up -d |
— |
| 12 | Monitor startup — VRAM telemetry + log tailing until the server reports ready, or an OOM is detected | — |
A single successful run leaves you with a running, health-checked server at
http://<BIND_HOST>:<PORT>/v1. If you want to run any of these steps
individually — for debugging, or because you want finer control — see
Manual Step-by-Step Setup below.
⚠ If NVIDIA drivers were just installed by
install-prereqs.shfor the first time, reboot, then re-rundeploy.sh.
Benchmark — single-stream throughput, using a multi-step Rust/Tokio programming prompt:
bash scripts/tuning/benchmark.sh Run Elapsed (s) Prompt tok Comp tok Tok/s Finish
────────────────────────────────────────────────────────────────
1 42.31 387 1024 24.20 length
2 41.87 387 1024 24.46 length
3 43.02 387 1024 23.80 length
════════════════════════════════════════════════════════════════
SUMMARY (n=3 runs)
════════════════════════════════════════════════════════════════
Avg prompt tokens 387
Avg completion tokens 1024
Avg generation time 42.40 s
Avg throughput 24.15 tok/s
Peak throughput 24.46 tok/s
Min throughput 23.80 tok/s
════════════════════════════════════════════════════════════════
Performance tier: ✓ Good (15–30 tok/s)
Results saved to: benchmark-results/benchmark_20260702T044144Z.json
Load test (optional) — benchmark.sh measures one request at a time;
real usage (multiple editor sessions, multiple users) is concurrent.
load-test.sh fires many small requests from several simultaneous workers
and reports aggregate requests/sec, aggregate tokens/sec, and latency
percentiles (p50/p95/p99):
bash scripts/tuning/load-test.sh [concurrency] [duration_seconds] # defaults: 4, 30
bash scripts/tuning/load-test.sh 8 60 # 8 concurrent clients, 60sIt also samples nvidia-smi's per-GPU throttle-reason flags (power cap,
HW/SW thermal slowdown, power brake) once per second for the duration of
the run and reports any that go active — distinct from
check-bottlenecks.sh's point-in-time power-cap check, this catches
actual throttling as it happens under sustained concurrent load (e.g. a
card that's fine at idle but thermal-throttles a minute into real traffic).
Tracking changes over time — every benchmark.sh and load-test.sh
run is saved as a timestamped JSON record in benchmark-results/. After
changing scripts/deploy/.env (via tune-inference.sh) or host tuning (via
check-bottlenecks.sh's recommendations), re-run the same script and diff
against the previous result:
bash scripts/tuning/compare-benchmarks.sh # latest 2 benchmark.sh runs
bash scripts/tuning/compare-benchmarks.sh --load-test # latest 2 load-test.sh runsIt prints a per-metric delta table and exits non-zero if any metric regressed beyond its threshold (throughput: 10%, latency: 15–20% depending on percentile) — safe to drop into a personal tuning script or CI job for GPU tuning iterations.
benchmark-results/ grows one JSON file per run and is never pruned
automatically. Once it has real history, trim it with:
bash scripts/tuning/compare-benchmarks.sh --prune # dry run — lists what would be deleted (keeps last 20 of each type)
bash scripts/tuning/compare-benchmarks.sh --prune --keep 10 # dry run with a custom retention count
bash scripts/tuning/compare-benchmarks.sh --prune --force # actually deleteDry run is the default since deleting benchmark history is irreversible —
nothing is removed until you pass --force.
deploy.sh covers the steps below automatically. Run them individually
only if you want to re-run a single step in isolation (e.g. re-validate
after a hardware change) or need to debug a specific stage.
Expand manual steps
Step 1 — Install prerequisites
sudo bash scripts/prereqs/install-prereqs.shInstalls NVIDIA drivers (if absent), Docker CE, nvidia-container-toolkit, and system tools. Fully idempotent — safe to re-run.
Anything already installed (driver, Docker, toolkit, NVIDIA runtime
registration in daemon.json) is left untouched — never updated, upgraded,
or reconfigured. For anything missing, it prompts [y/N] before installing
each component. Pass -y/--yes to auto-confirm every prompt for
unattended/automated runs (e.g. deploy.sh still runs it in the
foreground, so prompts surface normally unless you pass -y).
Step 2 — Validate system hardware
bash scripts/deploy/validate-system.shChecks Docker ↔ GPU connectivity, PCIe link quality (Gen/Width), and display server VRAM impact on GPU 0.
| Check | Pass Condition |
|---|---|
| Docker GPU access | Container sees all GPUs via nvidia runtime |
| PCIe Gen | current == max for each GPU |
| PCIe Width | current == max for each GPU |
| Idle VRAM (GPU 0) | < 800 MiB (no display server consuming budget) |
If a display server is detected on GPU 0, switch to headless mode before deploying (saves ~600–1500 MiB on GPU 0):
sudo systemctl isolate multi-user.targetStep 3 — Generate tuned configuration
bash scripts/tuning/tune-inference.shQueries your GPU topology dynamically and writes a hardware-appropriate
scripts/deploy/.env. If scripts/deploy/.env doesn't exist yet, it's created from
scratch. If it already exists, only the hardware-tuned keys below are
updated in place — any change is printed as an old -> new diff before
being applied. Everything else in the file (MODEL, BIND_HOST, PORT,
HF_CACHE_DIR, HF_TOKEN, optional feature flags, comments) is left
exactly as it is on disk.
| Parameter | Value (24 GiB setup) | Rationale |
|---|---|---|
TENSOR_PARALLEL_SIZE |
2 | One shard per GPU |
GPU_MEMORY_UTILIZATION |
0.85 | Headroom for CUDA overhead + a display server on GPU 0 |
MAX_MODEL_LEN |
16384 | KV cache stays within VRAM budget (14B model; see Display Server Impact for pushing this to 32768) |
SWAP_SPACE |
4 GiB | CPU offload buffer for burst traffic |
Review scripts/deploy/.env before continuing — you may manually adjust any
value.
Step 4 — Start the server and monitor initialization
bash scripts/deploy/validate-vram.shLaunches the container (docker compose up -d against the plain
docker-compose.yml, no override file) and monitors VRAM allocation in
real time during the KV cache loading phase, every 5s for up to 150s.
| Signal | Action |
|---|---|
Uvicorn running on... |
✅ Exit 0 — server ready |
CUDA out of memory |
❌ Exit 1 — prints recovery instructions |
| Timeout (150s) | ⚠ Model still downloading — check docker logs vllm-coder-server --follow |
Or bypass the monitor entirely:
docker compose -f deploy-artifacts/docker-compose.yml up -d
docker logs vllm-coder-server --followFrom here, continue with Verify performance above, or Client & Editor Integrations below.
⚠ Going this manual route skips the boot-persistence check
deploy.shdoes automatically.docker-compose.ymlsetsrestart: unless-stopped, so the container itself restarts once the Docker daemon is up — but only ifdocker.serviceis enabled to start at boot:sudo systemctl enable docker
Once the server is running, it exposes a fully OpenAI-compatible API:
# Chat Completion
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-coder-14b-awq",
"messages": [{"role": "user", "content": "Write a Python async HTTP client"}],
"max_tokens": 512,
"temperature": 0.1
}'
# List loaded models
curl http://localhost:8000/v1/models
# Health check
curl http://localhost:8000/healthOpenAI Python SDK:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy", # vLLM does not enforce API keys by default
)
response = client.chat.completions.create(
model="qwen2.5-coder-14b-awq",
messages=[{"role": "user", "content": "Implement a binary search tree in Go"}],
max_tokens=1024,
temperature=0.1,
)
print(response.choices[0].message.content)These are client-side setup steps, separate from deploying, tuning, or
benchmarking the server. Each script below configures one editor/tool to
point at your running vLLM endpoint — run interactively (it prompts, or
reads BIND_HOST:PORT from .env if present) or with the host passed
directly as an argument. Run a given script on whichever workstation has
that tool installed — it does not need to be the machine hosting vLLM.
Zed is the primary supported editor for this stack's AI assistant integration.
Step 1 — Run the configuration script:
bash scripts/deploy/setup-zed.sh # reads host/port from .env or prompts
bash scripts/deploy/setup-zed.sh 192.168.1.50:8000 # or pass the host directlyThis configures ~/.config/zed/settings.json, injecting the vLLM endpoint
as a custom OpenAI-compatible provider. It resolves the live model ID and
context window from GET /v1/models if the server is reachable, falling
back to scripts/deploy/.env's SERVED_MODEL_NAME/MAX_MODEL_LEN
otherwise. Any existing settings.json is backed up first
(settings.json.bak.<timestamp>), so it's safe to re-run whenever the
server's model or context length changes.
Step 2 — Set the API key placeholder:
vLLM doesn't enforce an API key, but Zed's OpenAI provider requires one to be present:
- In Zed, open the Agent panel and click the model selector in the
bottom-right corner → Configure... (or run the
agent: settingscommand viaCtrl+Shift+A). - Under the OpenAI provider section, enter
dummyas the API key.
| Setting | Value |
|---|---|
| Provider | openai (custom endpoint via api_url) |
| API Base | http://<host>:<port>/v1 |
| Model | Resolved live from GET /v1/models if reachable (typically qwen2.5-coder-14b-awq) — falls back to .env's SERVED_MODEL_NAME |
| API Key | Not enforced by vLLM — use dummy in Zed's provider settings |
Install the Continue extension/plugin (VS Code Marketplace or JetBrains Marketplace, for IDEs like PyCharm/IntelliJ/WebStorm/CLion), then run:
bash scripts/deploy/setup-continue.sh
# or: bash scripts/deploy/setup-continue.sh 192.168.1.50:8000- VS Code — run the script, then reload the window (
Ctrl+Shift+P→ Developer: Reload Window). - JetBrains IDEs (PyCharm, IntelliJ IDEA, WebStorm, CLion, etc.) — run the script (Continue in JetBrains shares the same
~/.continue/config.jsonglobal path on Linux/macOS), then restart the IDE or click the gear icon in the Continue sidebar to refresh.
The setup script injects the vLLM endpoint into ~/.continue/config.yaml on the machine you run it on, setting it as both the chat model and the tab-autocomplete model. It will create the file with full defaults if it doesn't exist, or patch it safely (with a backup) if it does.
| Setting | Value |
|---|---|
| Provider | openai (OpenAI-compatible) |
| API Base | http://<host>:<port>/v1 |
| Model | Resolved live from GET /v1/models if the server is reachable (typically qwen2.5-coder-14b-awq, matching --served-model-name) — falls back to MODEL= from scripts/deploy/.env with a warning if it isn't (only relevant when run on the vLLM host itself, since that's the only place scripts/deploy/.env exists) |
| Autocomplete | Same resolved model, max_tokens=512, temperature=0.05 |
If the server isn't reachable yet when you run this from a remote workstation, it falls back to the default HuggingFace model ID — re-run it once the server is up for an accurate config.
You can also use Aider as a command-line coding assistant powered by the vLLM instance. A setup script is provided to automate Aider installation and configuration.
bash scripts/deploy/setup-aider.sh
# or: bash scripts/deploy/setup-aider.sh 192.168.1.50:8000This script:
- Detects if Aider is installed. If it is not, it stops to confirm if you want to install it (supporting installation via
pipxorpip). - Resolves the vLLM server address (interactively prompting for IP and port, reading from
scripts/deploy/.env, or using the command-line argument). - Safely updates or creates Aider configuration files (
.aider.conf.ymlat the project root or~/.aider.conf.ymlin your home directory) to use the local vLLM endpoint, specifically patching only the OpenAI-compatible API base URL, API key, and model parameters. - Generates or updates
.aider.model.metadata.jsonalongside your Aider config to register the correct context window size (based on the server'sMAX_MODEL_LEN) and token cost structures, suppressing any "Unknown context window size and costs" warnings.
Once configured, simply run:
aiderIf vLLM exits with a CUDA OOM error during initialization:
# Edit scripts/deploy/.env
MAX_MODEL_LEN=8192 # Halve the context window
GPU_MEMORY_UTILIZATION=0.85 # Increase headroom
# Restart
docker compose -f deploy-artifacts/docker-compose.yml down
sudo bash scripts/deploy/deploy.shThe RTX 3060 does not support NVLink. All inter-GPU communication for Tensor Parallelism goes over PCIe. A secondary slot running at x4 instead of x16 will reduce NCCL all-reduce bandwidth and may increase latency by 10–25% on large token batches.
To diagnose: run bash scripts/deploy/validate-system.sh and review the
PCIe table (flags a GPU running below its own rated Gen/Width spec), or
bash scripts/tuning/check-bottlenecks.sh for the more directly relevant
check — it computes actual effective GB/s per link and flags anything
below the Gen3×8 / Gen4×4 floor that Tensor Parallelism's NCCL all-reduce
needs, which a card can fail even while running at its own full rated spec
(e.g. a Gen2×16 slot).
Even with the smaller 14B model, a desktop environment competing for GPU
0's VRAM can push KV cache allocation into an OOM at boot — this is why the
shipped defaults use GPU_MEMORY_UTILIZATION=0.85 and MAX_MODEL_LEN=16384
rather than the more aggressive 0.90/32768 the hardware can otherwise
support. If GPU 0 shows >800 MiB idle usage, free it before deploying:
# Free GPU 0 before deployment (non-destructive, re-enable with graphical.target)
sudo systemctl isolate multi-user.target
# Re-enable desktop when done
sudo systemctl isolate graphical.targetIf GPU 0 is already headless (no idle usage), you can raise
GPU_MEMORY_UTILIZATION back to 0.90 and MAX_MODEL_LEN to 32768 in
scripts/deploy/.env for the full context window.
# Stop the server
docker compose -f deploy-artifacts/docker-compose.yml down
# or: bash scripts/deploy/teardown.sh (same thing, plus a post-stop VRAM confirmation table)
# Capture a diagnostic snapshot before stopping (GPU state + recent logs) —
# useful when debugging a crash, OOM, or silent hang. Read-only, never
# modifies GPU/container state.
bash scripts/tuning/snapshot-diagnostics.sh [log_lines] # default: last 50 log lines
# Saved to ~/.local/share/vllm-snapshots/snapshot_<timestamp>.txt
# View live logs
docker compose -f deploy-artifacts/docker-compose.yml logs -f
# Restart after a config change
docker compose -f deploy-artifacts/docker-compose.yml down
sudo bash scripts/deploy/deploy.sh # Re-tunes and regenerates the override file
# Check container health
docker inspect --format='{{.State.Health.Status}}' vllm-coder-serverThis repository includes optional support for Open WebUI, allowing you to interact with the hosted vLLM model through a beautiful ChatGPT-like browser interface.
- Browser Access: Use the model from any device (phone, laptop, tablet) on your local network.
- Optional & Disabled by Default: Kept disabled by default (
ENABLE_OPEN_WEBUI=false) to preserve the core vLLM-only focus. - Data Persistence: Open WebUI's database, user accounts, and chat history are saved in a persistent Docker volume, preserving your data across container restarts, redeployments, and normal
teardown.shoperations. - Auto-Boot: Starts automatically at host reboot alongside vLLM when enabled.
All Open WebUI settings live in your scripts/deploy/.env file. Copy these values from scripts/deploy/.env.example if they are not already in your configuration:
# Enable Open WebUI deployment (true/false)
ENABLE_OPEN_WEBUI=true
# Port on the host network where Open WebUI will listen
OPEN_WEBUI_PORT=3000
# Subnet CIDR of your private LAN to restrict firewall access (optional)
# Example: LAN_CIDR=10.1.10.0/24
LAN_CIDR=10.1.10.0/24Simply set ENABLE_OPEN_WEBUI=true in scripts/deploy/.env and run the deployment script:
sudo bash scripts/deploy/deploy.shThe script will automatically detect that Open WebUI is enabled, perform port availability checks, open UFW/firewalld rules (restricted to LAN_CIDR if set), launch the container stack, and validate that both vLLM and Open WebUI are running and configured with their restart policies.
After successful deployment, the script outputs the connection URLs:
vLLM API : http://10.1.10.17:8000/v1
Open WebUI : http://10.1.10.17:3000
Note
On the first run of Open WebUI, you will need to sign up to create the admin account. This account is entirely local and does not send any data outside your network. Since this setup is intended for a trusted home LAN, there is no TLS or external authentication configured by default.
To verify both services are running and accessible from either the host itself or another LAN client:
# Run the validation smoke test
bash scripts/deploy/smoke-test.shYou can also pass overrides to verify connectivity from another machine on your LAN:
# Usage: bash scripts/deploy/smoke-test.sh [host-ip] [vllm-port] [open-webui-port] [enable-webui]
bash scripts/deploy/smoke-test.sh 10.1.10.17 8000 3000 trueThe smoke test validates:
- vLLM
/v1/modelsendpoint responds successfully. - Open WebUI HTTP endpoint responds on the configured port.
- Both containers are configured with
unless-stopped(or your configured) restart policies.
To stop the services and release GPU VRAM:
bash scripts/deploy/teardown.shThis stops both vLLM and Open WebUI containers. Your Open WebUI chat history, user accounts, and settings are preserved.
To perform a deep-clean and delete all Open WebUI data/volumes, pass the --purge flag:
# WARNING: This deletes the Open WebUI database/volume permanently!
bash scripts/deploy/teardown.sh --purge- Container fails to start or port already in use: The deployment script checks port availability and fails-fast. If the port is in use, verify with
ss -tlnp(as root) or configure a differentOPEN_WEBUI_PORTinscripts/deploy/.env. - Cannot reach Open WebUI from another LAN host: Ensure
OPEN_WEBUI_HOSTis set to0.0.0.0(all interfaces) inscripts/deploy/.env. Verify that the firewall (UFW/firewalld) is allowing the port and thatLAN_CIDRmatches your client's subnet. - Open WebUI cannot reach vLLM: Open WebUI connects to vLLM inside the Docker network. Ensure
OPEN_WEBUI_OPENAI_API_BASE_URLinscripts/deploy/.envpoints tohttp://vllm:8000/v1(using the container service namevllmrather thanlocalhost). - Services not starting after reboot: Check if the Docker service is enabled to start at boot (
systemctl is-enabled docker). Verify the restart policies inscripts/deploy/.env(VLLM_RESTART_POLICYandOPEN_WEBUI_RESTART_POLICY) are set tounless-stoppedoralways.
- The server binds to
0.0.0.0on all interfaces by default. If running on a network-accessible machine, setBIND_HOST=127.0.0.1inscripts/deploy/.envto restrict it to localhost, or add a firewall rule. - If
BIND_HOSTis set to a LAN address,deploy.shopens the API port onufw/firewalldautomatically (only if one of them is active — it never installs or enables a firewall for you). It only ever opens the singlePORTfromscripts/deploy/.env, never a broad range. - vLLM does not enforce API key authentication by default. Add
--api-key <secret>to the command indocker-compose.yml(or viadocker-compose.override.yml) to enable it. - The HuggingFace cache is mounted from the host via
HF_CACHE_DIR. Ensure the model cache directory has appropriate permissions.
This repository uses pre-commit to automate code validation and enforce security best practices before any changes are committed.
Syntax linting & formatting
trailing-whitespace— trims trailing whitespace from filesend-of-file-fixer— ensures files end with a newlinecheck-yaml— validates YAML syntax (e.g.deploy-artifacts/docker-compose.yml,.pre-commit-config.yaml)check-json— validates JSON syntaxcheck-added-large-files— blocks accidentally committing large files (e.g. model weights, cached tensors)shellcheck— runs ShellCheck on all shell scripts inscripts/
Security & secret detection
detect-private-key— checks for the presence of private keysdetect-secrets— scans staged changes for hardcoded secrets, API keys, or credentials using detect-secrets (no account/registration required, unlike some hosted secret-scanning services). Known false positives (e.g. the literal placeholderapi_key="dummy"used since vLLM doesn't enforce API keys by default) are tracked in.secrets.baseline— if you intentionally add a new one, regenerate it withdetect-secrets scan > .secrets.baselineand mark it as a false positive.- Custom commit message scanner —
git-hooks/check-commit-msg-secrets.pyscans Git commit messages for secrets (e.g. AWS keys, Slack tokens, high-entropy API keys) during thecommit-msghook phase
Run every pre-commit check against all files at any time:
pre-commit run --all-filesMIT