vllm-local-developer-stack

Production-grade automation for self-hosting Qwen2.5-Coder-14B-Instruct-AWQ on a dual-GPU RTX 3060 setup using vLLM with Tensor Parallelism.

Component	Specification
GPUs	2× NVIDIA RTX 3060 12GB
Total VRAM	24 GB
Parallelism	Tensor Parallel (size=2)
Host OS	Ubuntu 22.04 LTS
Model	`Qwen/Qwen2.5-Coder-14B-Instruct-AWQ`
Quantization	AWQ (4-bit activation-aware)
Context	16,384 tokens (can be raised to 32,768 on a headless GPU 0 — see Display Server Impact)

Deployment Scope

This repository deploys a single-node vLLM OpenAI-compatible API server for use on a private homelab network.

The intended use case is:

Run vLLM on one GPU-equipped Ubuntu host
Expose the vLLM API on the private LAN
Connect editor/CLI tools such as Zed, Continue (VS Code/JetBrains), or Aider
Optionally connect a local web UI for direct chat interaction

This repo does not currently target:

Multi-node vLLM
Ray-based distributed inference
Kubernetes
Public internet exposure
Production authentication or TLS

Repository Structure

vllm-local-developer-stack/
├── .gitignore                          # Ignores WORKLOG.md, scripts/deploy/.env, generated override files, benchmark results
├── .pre-commit-config.yaml             # Pre-commit framework configuration
├── .secrets.baseline                   # detect-secrets baseline — known false positives
├── README.md
├── deploy-artifacts/
│   ├── docker-compose.yml              # vLLM service definition (image, GPU reservation, healthcheck)
│   ├── docker-compose.open-webui.yml  # Optional WebUI service and volume definition
│   └── docker-compose.override.yml     # Auto-generated by deploy.sh — gitignored, do not edit
├── git-hooks/
│   └── check-commit-msg-secrets.py     # commit-msg hook: scans commit messages for secrets
└── scripts/
    ├── prereqs/
    │   └── install-prereqs.sh          # Idempotent dependency installer (drivers, Docker, toolkit)
    ├── deploy/
    │   ├── deploy.sh                   # Single entry point — orchestrates the full deployment
    │   ├── .env.example                # Annotated parameter reference — copy to .env
    │   ├── validate-system.sh          # Pre-flight hardware validation
    │   ├── validate-vram.sh            # Live startup telemetry monitor
    │   ├── setup-zed.sh                # Zed IDE integration hook (primary supported IDE)
    │   ├── setup-continue.sh           # Continue extension hook (VS Code & JetBrains)
    │   ├── setup-aider.sh              # Aider CLI integration hook
    │   ├── smoke-test.sh               # Smoke test verification for endpoints
    │   └── teardown.sh                 # Graceful server teardown wrapper (stops and deletes containers)
    └── tuning/
        ├── tune-inference.sh           # Hardware-sensing config generator
        ├── check-bottlenecks.sh        # Hardware & OS performance advisor (--json for machine-readable output)
        ├── snapshot-diagnostics.sh     # Read-only GPU state + log capture (run before teardown.sh when debugging)
        ├── benchmark.sh                # Single-stream token throughput evaluator
        ├── load-test.sh                # Concurrent-request load tester (req/s, latency percentiles)
        └── compare-benchmarks.sh       # Regression detector across benchmark-results/ history

benchmark.sh and load-test.sh each write a timestamped JSON record to benchmark-results/ (created on first run) so tuning changes can be compared over time with compare-benchmarks.sh.

Deploying

1. Configure `scripts/deploy/.env`

cp scripts/deploy/.env.example scripts/deploy/.env
$EDITOR scripts/deploy/.env          # set BIND_HOST, HF_CACHE_DIR at minimum

See scripts/deploy/.env.example for every available option and its explanation.

2. Run the deploy script

sudo bash scripts/deploy/deploy.sh

deploy.sh is the single entry point for standing up the stack. It is not something you run after validating and tuning the system — it runs those steps for you, in this order. Root is required end-to-end (package installation and enabling the Docker service both need it):

Step	What happens	Blocking?
1	Load and validate `scripts/deploy/.env`	✅ Aborts if `.env` is missing
2	Resolve `BIND_HOST` (auto-detects your LAN IP if unset)	—
3	If `BIND_HOST` is a non-loopback address (e.g. `10.1.10.17`, `192.168.0.x`), check for an active firewall (`ufw`/`firewalld`) and open the API port if it isn't already allowed; if no firewall is active, or `BIND_HOST` is `127.0.0.1`, skip — nothing to do	—
4	Run `install-prereqs.sh` — drivers, Docker, nvidia-container-toolkit (idempotent; prompts before installing anything missing, never touches what's already there)	✅
5	Run `validate-system.sh` — GPU ↔ Docker connectivity, PCIe link quality	✅
6	Run `check-bottlenecks.sh` — performance advisory	⚠️ Never blocks
7	Run `tune-inference.sh` — updates only the GPU-tuned keys in `scripts/deploy/.env` in place, diffing against any existing values first	✅
8	Re-apply your user-set values on top (your settings always win over auto-tuning)	—
9	Generate `docker-compose.override.yml` with the fully-resolved vLLM command	—
10	Ensure `docker.service` is enabled at boot, so the container (`restart: unless-stopped`) comes back up after a host reboot, not just a plain restart	—
11	`docker compose up -d`	—
12	Monitor startup — VRAM telemetry + log tailing until the server reports ready, or an OOM is detected	—

A single successful run leaves you with a running, health-checked server at http://<BIND_HOST>:<PORT>/v1. If you want to run any of these steps individually — for debugging, or because you want finer control — see Manual Step-by-Step Setup below.

⚠ If NVIDIA drivers were just installed by install-prereqs.sh for the first time, reboot, then re-run deploy.sh.

3. Verify performance

Benchmark — single-stream throughput, using a multi-step Rust/Tokio programming prompt:

bash scripts/tuning/benchmark.sh

  Run  Elapsed (s)   Prompt tok    Comp tok    Tok/s      Finish
  ────────────────────────────────────────────────────────────────
  1    42.31          387           1024        24.20      length
  2    41.87          387           1024        24.46      length
  3    43.02          387           1024        23.80      length

  ════════════════════════════════════════════════════════════════
  SUMMARY (n=3 runs)
  ════════════════════════════════════════════════════════════════
  Avg prompt tokens                   387
  Avg completion tokens               1024
  Avg generation time                 42.40 s
  Avg throughput                      24.15 tok/s
  Peak throughput                     24.46 tok/s
  Min throughput                      23.80 tok/s
  ════════════════════════════════════════════════════════════════
  Performance tier: ✓ Good (15–30 tok/s)

  Results saved to: benchmark-results/benchmark_20260702T044144Z.json

Load test (optional) — benchmark.sh measures one request at a time; real usage (multiple editor sessions, multiple users) is concurrent. load-test.sh fires many small requests from several simultaneous workers and reports aggregate requests/sec, aggregate tokens/sec, and latency percentiles (p50/p95/p99):

bash scripts/tuning/load-test.sh [concurrency] [duration_seconds]   # defaults: 4, 30
bash scripts/tuning/load-test.sh 8 60                                # 8 concurrent clients, 60s

It also samples nvidia-smi's per-GPU throttle-reason flags (power cap, HW/SW thermal slowdown, power brake) once per second for the duration of the run and reports any that go active — distinct from check-bottlenecks.sh's point-in-time power-cap check, this catches actual throttling as it happens under sustained concurrent load (e.g. a card that's fine at idle but thermal-throttles a minute into real traffic).

Tracking changes over time — every benchmark.sh and load-test.sh run is saved as a timestamped JSON record in benchmark-results/. After changing scripts/deploy/.env (via tune-inference.sh) or host tuning (via check-bottlenecks.sh's recommendations), re-run the same script and diff against the previous result:

bash scripts/tuning/compare-benchmarks.sh              # latest 2 benchmark.sh runs
bash scripts/tuning/compare-benchmarks.sh --load-test   # latest 2 load-test.sh runs

It prints a per-metric delta table and exits non-zero if any metric regressed beyond its threshold (throughput: 10%, latency: 15–20% depending on percentile) — safe to drop into a personal tuning script or CI job for GPU tuning iterations.

benchmark-results/ grows one JSON file per run and is never pruned automatically. Once it has real history, trim it with:

bash scripts/tuning/compare-benchmarks.sh --prune              # dry run — lists what would be deleted (keeps last 20 of each type)
bash scripts/tuning/compare-benchmarks.sh --prune --keep 10     # dry run with a custom retention count
bash scripts/tuning/compare-benchmarks.sh --prune --force       # actually delete

Dry run is the default since deleting benchmark history is irreversible — nothing is removed until you pass --force.

Manual Step-by-Step Setup (Advanced)

deploy.sh covers the steps below automatically. Run them individually only if you want to re-run a single step in isolation (e.g. re-validate after a hardware change) or need to debug a specific stage.

Expand manual steps

Step 1 — Install prerequisites

sudo bash scripts/prereqs/install-prereqs.sh

Installs NVIDIA drivers (if absent), Docker CE, nvidia-container-toolkit, and system tools. Fully idempotent — safe to re-run.

Anything already installed (driver, Docker, toolkit, NVIDIA runtime registration in daemon.json) is left untouched — never updated, upgraded, or reconfigured. For anything missing, it prompts [y/N] before installing each component. Pass -y/--yes to auto-confirm every prompt for unattended/automated runs (e.g. deploy.sh still runs it in the foreground, so prompts surface normally unless you pass -y).

Step 2 — Validate system hardware

bash scripts/deploy/validate-system.sh

Checks Docker ↔ GPU connectivity, PCIe link quality (Gen/Width), and display server VRAM impact on GPU 0.

Check	Pass Condition
Docker GPU access	Container sees all GPUs via `nvidia` runtime
PCIe Gen	`current` == `max` for each GPU
PCIe Width	`current` == `max` for each GPU
Idle VRAM (GPU 0)	< 800 MiB (no display server consuming budget)

If a display server is detected on GPU 0, switch to headless mode before deploying (saves ~600–1500 MiB on GPU 0):

sudo systemctl isolate multi-user.target

Step 3 — Generate tuned configuration

bash scripts/tuning/tune-inference.sh

Queries your GPU topology dynamically and writes a hardware-appropriate scripts/deploy/.env. If scripts/deploy/.env doesn't exist yet, it's created from scratch. If it already exists, only the hardware-tuned keys below are updated in place — any change is printed as an old -> new diff before being applied. Everything else in the file (MODEL, BIND_HOST, PORT, HF_CACHE_DIR, HF_TOKEN, optional feature flags, comments) is left exactly as it is on disk.

Parameter	Value (24 GiB setup)	Rationale
`TENSOR_PARALLEL_SIZE`	2	One shard per GPU
`GPU_MEMORY_UTILIZATION`	0.85	Headroom for CUDA overhead + a display server on GPU 0
`MAX_MODEL_LEN`	16384	KV cache stays within VRAM budget (14B model; see Display Server Impact for pushing this to 32768)
`SWAP_SPACE`	4 GiB	CPU offload buffer for burst traffic

Review scripts/deploy/.env before continuing — you may manually adjust any value.

Step 4 — Start the server and monitor initialization

bash scripts/deploy/validate-vram.sh

Launches the container (docker compose up -d against the plain docker-compose.yml, no override file) and monitors VRAM allocation in real time during the KV cache loading phase, every 5s for up to 150s.

Signal	Action
`Uvicorn running on...`	✅ Exit 0 — server ready
`CUDA out of memory`	❌ Exit 1 — prints recovery instructions
Timeout (150s)	⚠ Model still downloading — check `docker logs vllm-coder-server --follow`

Or bypass the monitor entirely:

docker compose -f deploy-artifacts/docker-compose.yml up -d
docker logs vllm-coder-server --follow

From here, continue with Verify performance above, or Client & Editor Integrations below.

⚠ Going this manual route skips the boot-persistence check deploy.sh does automatically. docker-compose.yml sets restart: unless-stopped, so the container itself restarts once the Docker daemon is up — but only if docker.service is enabled to start at boot:
sudo systemctl enable docker

API Usage

Once the server is running, it exposes a fully OpenAI-compatible API:

# Chat Completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-coder-14b-awq",
    "messages": [{"role": "user", "content": "Write a Python async HTTP client"}],
    "max_tokens": 512,
    "temperature": 0.1
  }'

# List loaded models
curl http://localhost:8000/v1/models

# Health check
curl http://localhost:8000/health

OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy",  # vLLM does not enforce API keys by default
)

response = client.chat.completions.create(
    model="qwen2.5-coder-14b-awq",
    messages=[{"role": "user", "content": "Implement a binary search tree in Go"}],
    max_tokens=1024,
    temperature=0.1,
)
print(response.choices[0].message.content)

Client & Editor Integrations

These are client-side setup steps, separate from deploying, tuning, or benchmarking the server. Each script below configures one editor/tool to point at your running vLLM endpoint — run interactively (it prompts, or reads BIND_HOST:PORT from .env if present) or with the host passed directly as an argument. Run a given script on whichever workstation has that tool installed — it does not need to be the machine hosting vLLM.

Zed

Zed is the primary supported editor for this stack's AI assistant integration.

Step 1 — Run the configuration script:

bash scripts/deploy/setup-zed.sh                    # reads host/port from .env or prompts
bash scripts/deploy/setup-zed.sh 192.168.1.50:8000  # or pass the host directly

This configures ~/.config/zed/settings.json, injecting the vLLM endpoint as a custom OpenAI-compatible provider. It resolves the live model ID and context window from GET /v1/models if the server is reachable, falling back to scripts/deploy/.env's SERVED_MODEL_NAME/MAX_MODEL_LEN otherwise. Any existing settings.json is backed up first (settings.json.bak.<timestamp>), so it's safe to re-run whenever the server's model or context length changes.

Step 2 — Set the API key placeholder:

vLLM doesn't enforce an API key, but Zed's OpenAI provider requires one to be present:

In Zed, open the Agent panel and click the model selector in the bottom-right corner → Configure... (or run the agent: settings command via Ctrl+Shift+A).
Under the OpenAI provider section, enter dummy as the API key.

Setting	Value
Provider	`openai` (custom endpoint via `api_url`)
API Base	`http://<host>:<port>/v1`
Model	Resolved live from `GET /v1/models` if reachable (typically `qwen2.5-coder-14b-awq`) — falls back to `.env`'s `SERVED_MODEL_NAME`
API Key	Not enforced by vLLM — use `dummy` in Zed's provider settings

Continue (VS Code & JetBrains)

Install the Continue extension/plugin (VS Code Marketplace or JetBrains Marketplace, for IDEs like PyCharm/IntelliJ/WebStorm/CLion), then run:

bash scripts/deploy/setup-continue.sh
# or: bash scripts/deploy/setup-continue.sh 192.168.1.50:8000

Supported IDEs

VS Code — run the script, then reload the window (Ctrl+Shift+P → Developer: Reload Window).
JetBrains IDEs (PyCharm, IntelliJ IDEA, WebStorm, CLion, etc.) — run the script (Continue in JetBrains shares the same ~/.continue/config.json global path on Linux/macOS), then restart the IDE or click the gear icon in the Continue sidebar to refresh.

How it Works / Custom Configurations

The setup script injects the vLLM endpoint into ~/.continue/config.yaml on the machine you run it on, setting it as both the chat model and the tab-autocomplete model. It will create the file with full defaults if it doesn't exist, or patch it safely (with a backup) if it does.

Setting	Value
Provider	`openai` (OpenAI-compatible)
API Base	`http://<host>:<port>/v1`
Model	Resolved live from `GET /v1/models` if the server is reachable (typically `qwen2.5-coder-14b-awq`, matching `--served-model-name`) — falls back to `MODEL=` from `scripts/deploy/.env` with a warning if it isn't (only relevant when run on the vLLM host itself, since that's the only place `scripts/deploy/.env` exists)
Autocomplete	Same resolved model, `max_tokens=512`, `temperature=0.05`

If the server isn't reachable yet when you run this from a remote workstation, it falls back to the default HuggingFace model ID — re-run it once the server is up for an accurate config.

Aider

You can also use Aider as a command-line coding assistant powered by the vLLM instance. A setup script is provided to automate Aider installation and configuration.

bash scripts/deploy/setup-aider.sh
# or: bash scripts/deploy/setup-aider.sh 192.168.1.50:8000

This script:

Detects if Aider is installed. If it is not, it stops to confirm if you want to install it (supporting installation via pipx or pip).
Resolves the vLLM server address (interactively prompting for IP and port, reading from scripts/deploy/.env, or using the command-line argument).
Safely updates or creates Aider configuration files (.aider.conf.yml at the project root or ~/.aider.conf.yml in your home directory) to use the local vLLM endpoint, specifically patching only the OpenAI-compatible API base URL, API key, and model parameters.
Generates or updates .aider.model.metadata.json alongside your Aider config to register the correct context window size (based on the server's MAX_MODEL_LEN) and token cost structures, suppressing any "Unknown context window size and costs" warnings.

Once configured, simply run:

aider

Tuning Reference

OOM Recovery

If vLLM exits with a CUDA OOM error during initialization:

# Edit scripts/deploy/.env
MAX_MODEL_LEN=8192          # Halve the context window
GPU_MEMORY_UTILIZATION=0.85 # Increase headroom

# Restart
docker compose -f deploy-artifacts/docker-compose.yml down
sudo bash scripts/deploy/deploy.sh

PCIe Bandwidth Notes

The RTX 3060 does not support NVLink. All inter-GPU communication for Tensor Parallelism goes over PCIe. A secondary slot running at x4 instead of x16 will reduce NCCL all-reduce bandwidth and may increase latency by 10–25% on large token batches.

To diagnose: run bash scripts/deploy/validate-system.sh and review the PCIe table (flags a GPU running below its own rated Gen/Width spec), or bash scripts/tuning/check-bottlenecks.sh for the more directly relevant check — it computes actual effective GB/s per link and flags anything below the Gen3×8 / Gen4×4 floor that Tensor Parallelism's NCCL all-reduce needs, which a card can fail even while running at its own full rated spec (e.g. a Gen2×16 slot).

Display Server Impact

Even with the smaller 14B model, a desktop environment competing for GPU 0's VRAM can push KV cache allocation into an OOM at boot — this is why the shipped defaults use GPU_MEMORY_UTILIZATION=0.85 and MAX_MODEL_LEN=16384 rather than the more aggressive 0.90/32768 the hardware can otherwise support. If GPU 0 shows >800 MiB idle usage, free it before deploying:

# Free GPU 0 before deployment (non-destructive, re-enable with graphical.target)
sudo systemctl isolate multi-user.target

# Re-enable desktop when done
sudo systemctl isolate graphical.target

If GPU 0 is already headless (no idle usage), you can raise GPU_MEMORY_UTILIZATION back to 0.90 and MAX_MODEL_LEN to 32768 in scripts/deploy/.env for the full context window.

Server Management

# Stop the server
docker compose -f deploy-artifacts/docker-compose.yml down
# or: bash scripts/deploy/teardown.sh   (same thing, plus a post-stop VRAM confirmation table)

# Capture a diagnostic snapshot before stopping (GPU state + recent logs) —
# useful when debugging a crash, OOM, or silent hang. Read-only, never
# modifies GPU/container state.
bash scripts/tuning/snapshot-diagnostics.sh [log_lines]   # default: last 50 log lines
# Saved to ~/.local/share/vllm-snapshots/snapshot_<timestamp>.txt

# View live logs
docker compose -f deploy-artifacts/docker-compose.yml logs -f

# Restart after a config change
docker compose -f deploy-artifacts/docker-compose.yml down
sudo bash scripts/deploy/deploy.sh   # Re-tunes and regenerates the override file

# Check container health
docker inspect --format='{{.State.Health.Status}}' vllm-coder-server

Open WebUI Support (Optional)

This repository includes optional support for Open WebUI, allowing you to interact with the hosted vLLM model through a beautiful ChatGPT-like browser interface.

Key Features

Browser Access: Use the model from any device (phone, laptop, tablet) on your local network.
Optional & Disabled by Default: Kept disabled by default (ENABLE_OPEN_WEBUI=false) to preserve the core vLLM-only focus.
Data Persistence: Open WebUI's database, user accounts, and chat history are saved in a persistent Docker volume, preserving your data across container restarts, redeployments, and normal teardown.sh operations.
Auto-Boot: Starts automatically at host reboot alongside vLLM when enabled.

Configuration

All Open WebUI settings live in your scripts/deploy/.env file. Copy these values from scripts/deploy/.env.example if they are not already in your configuration:

# Enable Open WebUI deployment (true/false)
ENABLE_OPEN_WEBUI=true

# Port on the host network where Open WebUI will listen
OPEN_WEBUI_PORT=3000

# Subnet CIDR of your private LAN to restrict firewall access (optional)
# Example: LAN_CIDR=10.1.10.0/24
LAN_CIDR=10.1.10.0/24

Deployment

Simply set ENABLE_OPEN_WEBUI=true in scripts/deploy/.env and run the deployment script:

sudo bash scripts/deploy/deploy.sh

The script will automatically detect that Open WebUI is enabled, perform port availability checks, open UFW/firewalld rules (restricted to LAN_CIDR if set), launch the container stack, and validate that both vLLM and Open WebUI are running and configured with their restart policies.

After successful deployment, the script outputs the connection URLs:

vLLM API      : http://10.1.10.17:8000/v1
Open WebUI    : http://10.1.10.17:3000

Note

On the first run of Open WebUI, you will need to sign up to create the admin account. This account is entirely local and does not send any data outside your network. Since this setup is intended for a trusted home LAN, there is no TLS or external authentication configured by default.

Verification & Smoke Testing

To verify both services are running and accessible from either the host itself or another LAN client:

# Run the validation smoke test
bash scripts/deploy/smoke-test.sh

You can also pass overrides to verify connectivity from another machine on your LAN:

# Usage: bash scripts/deploy/smoke-test.sh [host-ip] [vllm-port] [open-webui-port] [enable-webui]
bash scripts/deploy/smoke-test.sh 10.1.10.17 8000 3000 true

The smoke test validates:

vLLM /v1/models endpoint responds successfully.
Open WebUI HTTP endpoint responds on the configured port.
Both containers are configured with unless-stopped (or your configured) restart policies.

Teardown and Data Preservation

To stop the services and release GPU VRAM:

bash scripts/deploy/teardown.sh

This stops both vLLM and Open WebUI containers. Your Open WebUI chat history, user accounts, and settings are preserved.

To perform a deep-clean and delete all Open WebUI data/volumes, pass the --purge flag:

# WARNING: This deletes the Open WebUI database/volume permanently!
bash scripts/deploy/teardown.sh --purge

Troubleshooting

Container fails to start or port already in use: The deployment script checks port availability and fails-fast. If the port is in use, verify with ss -tlnp (as root) or configure a different OPEN_WEBUI_PORT in scripts/deploy/.env.
Cannot reach Open WebUI from another LAN host: Ensure OPEN_WEBUI_HOST is set to 0.0.0.0 (all interfaces) in scripts/deploy/.env. Verify that the firewall (UFW/firewalld) is allowing the port and that LAN_CIDR matches your client's subnet.
Open WebUI cannot reach vLLM: Open WebUI connects to vLLM inside the Docker network. Ensure OPEN_WEBUI_OPENAI_API_BASE_URL in scripts/deploy/.env points to http://vllm:8000/v1 (using the container service name vllm rather than localhost).
Services not starting after reboot: Check if the Docker service is enabled to start at boot (systemctl is-enabled docker). Verify the restart policies in scripts/deploy/.env (VLLM_RESTART_POLICY and OPEN_WEBUI_RESTART_POLICY) are set to unless-stopped or always.

Security Notes

The server binds to 0.0.0.0 on all interfaces by default. If running on a network-accessible machine, set BIND_HOST=127.0.0.1 in scripts/deploy/.env to restrict it to localhost, or add a firewall rule.
If BIND_HOST is set to a LAN address, deploy.sh opens the API port on ufw/firewalld automatically (only if one of them is active — it never installs or enables a firewall for you). It only ever opens the single PORT from scripts/deploy/.env, never a broad range.
vLLM does not enforce API key authentication by default. Add --api-key <secret> to the command in docker-compose.yml (or via docker-compose.override.yml) to enable it.
The HuggingFace cache is mounted from the host via HF_CACHE_DIR. Ensure the model cache directory has appropriate permissions.

Development & Code Quality

This repository uses pre-commit to automate code validation and enforce security best practices before any changes are committed.

Installed Hooks

Syntax linting & formatting

trailing-whitespace — trims trailing whitespace from files
end-of-file-fixer — ensures files end with a newline
check-yaml — validates YAML syntax (e.g. deploy-artifacts/docker-compose.yml, .pre-commit-config.yaml)
check-json — validates JSON syntax
check-added-large-files — blocks accidentally committing large files (e.g. model weights, cached tensors)
shellcheck — runs ShellCheck on all shell scripts in scripts/

Security & secret detection

detect-private-key — checks for the presence of private keys
detect-secrets — scans staged changes for hardcoded secrets, API keys, or credentials using detect-secrets (no account/registration required, unlike some hosted secret-scanning services). Known false positives (e.g. the literal placeholder api_key="dummy" used since vLLM doesn't enforce API keys by default) are tracked in .secrets.baseline — if you intentionally add a new one, regenerate it with detect-secrets scan > .secrets.baseline and mark it as a false positive.
Custom commit message scanner — git-hooks/check-commit-msg-secrets.py scans Git commit messages for secrets (e.g. AWS keys, Slack tokens, high-entropy API keys) during the commit-msg hook phase

Manual Verification

Run every pre-commit check against all files at any time:

pre-commit run --all-files

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
deploy-artifacts		deploy-artifacts
git-hooks		git-hooks
scripts		scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets.baseline		.secrets.baseline
LICENSE		LICENSE
README.md		README.md
WORKLOG.md		WORKLOG.md

Folders and files

Latest commit

History

Repository files navigation

vllm-local-developer-stack

Deployment Scope

Table of Contents

Repository Structure

Deploying

1. Configure scripts/deploy/.env

2. Run the deploy script

3. Verify performance

Manual Step-by-Step Setup (Advanced)

API Usage

Client & Editor Integrations

Zed

Continue (VS Code & JetBrains)

Supported IDEs

How it Works / Custom Configurations

Aider

Tuning Reference

OOM Recovery

PCIe Bandwidth Notes

Display Server Impact

Server Management

Open WebUI Support (Optional)

Key Features

Configuration

Deployment

Verification & Smoke Testing

Teardown and Data Preservation

Troubleshooting

Security Notes

Development & Code Quality

Installed Hooks

Manual Verification

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Configure `scripts/deploy/.env`

Packages