ci: add Linux x86-64 CUDA release builds on GitHub-hosted runners by achimnol · Pull Request #208 · lablup/mlxcel

achimnol · 2026-06-10T13:35:08Z

Summary

Adds a Linux x86-64 release variant so builds can be produced and verified beyond the current macOS aarch64 / Linux aarch64 (GB10, GH200) targets.

New build-linux-x86-cuda release job on GitHub-hosted ubuntu-24.04 runners, producing three CUDA 13 assets:
- mlxcel-linux-x86_64-cuda13-sm80 (A100)
- mlxcel-linux-x86_64-cuda13-sm90a (H100/H200; 90a gates MLX's qmm_sm90 kernel, same as the gh200 asset)
- mlxcel-linux-x86_64-cuda13-sm120 (RTX 50-series / RTX PRO Blackwell)
No GPU runner needed: compiling CUDA only requires the toolkit (MLX upstream CI builds its CUDA wheels on GPU-less runners the same way), and the release workflow intentionally runs no tests. Runtime verification happens out-of-band on real hardware.
MLXCEL_CXX_MARCH build.rs override: the bridge C++ previously hardcoded -march=native, which is correct for per-machine assets (gb10/gh200) but would bake the hosted runner's ISA (possibly AVX-512) into a redistributable binary and SIGILL on older CPUs. The x86-64 assets pin x86-64-v3 (AVX2) for both the bridge and rustc. Default behavior is unchanged (native).
-j 3 parallelism cap on the 16 GB hosted runners: the CUTLASS-heavy qmm_*.cu kernels peak at ~4-5 GB of cicc memory per parallel job; default -j$(nproc) OOMs (observed locally as gmake Error 137).
notify-teams and promote-release now also gate on the new job.
Docs: MLXCEL_CXX_MARCH reference, x86-64 build prerequisites, and troubleshooting entries for the qmm OOM and the liblapacke-dev requirement.

Codebase review notes (why no other code changes were needed)

mlx_cxx_bridge.h includes mlx/mlx.h, which declares mlx::core::metal::* on every platform; MLX's no_metal.cpp provides no-op stubs on non-Metal builds, so the Metal capture helpers compile and link on Linux as-is (already proven by the aarch64 CUDA assets).
The turbo Metal-JIT kernels (fast::metal_kernel) throw on non-Metal backends, but only behind the opt-in Turbo4Asym cache modes — identical exposure to the existing aarch64 Linux assets; not an x86-64 regression (tracked separately as Linux hardening).
All platform gates in the tree are OS-conditional (__APPLE__, target_os), none arch-conditional; the CUDA patches runtime-dispatch Sm75/Sm80/Sm90 CUTLASS paths.

Local verification (x86-64, Ubuntu 26.04, CUDA 13.3, RTX 5090)

cargo build --release --features cuda --locked succeeds; MLX_CUDA_ARCHITECTURES auto-detect correctly chose 120a.
Prerequisites beyond the toolkit: libcudnn9-dev-cuda-13, libopenblas-dev, liblapacke-dev (CMake fails with LAPACK_INCLUDE_DIRS NOTFOUND without lapacke.h).
Smoke test: mlxcel generate -m Qwen3-0.6B-4bit -p "Hello" -n 16 → Runtime device: NVIDIA GPU (CUDA), 9 tokens generated.

Verifying the workflow

A tag-less workflow_dispatch run with targets=linux is the documented build-only mode: it exercises the new job (and the aarch64 jobs) without uploading assets. The new job skips upload when release_tag is empty.

Out of scope / follow-ups

cargo/cmake caching for the hosted job (first iteration builds cold, ~within the 300 min timeout).
Gating the turbo4 sparse-V kernel path on Metal availability (pre-existing on all Linux builds).
CPU-only x86-64 asset (the debian/PPA packaging already covers that path).

Add a build-linux-x86-cuda release job that produces x86-64 Linux CUDA 13 assets for SM 80 (A100), SM 90a (H100/H200), and SM 120 (RTX 50-series / RTX PRO Blackwell). Unlike the self-hosted aarch64 jobs, these run on standard GitHub-hosted ubuntu-24.04 runners: compiling CUDA only needs the toolkit (nvcc compiles without a device; MLX upstream CI builds its CUDA wheels the same way), and the release workflow intentionally runs no tests. Redistributable binaries cannot inherit the build host's ISA, so the bridge C++'s unconditional -march=native gains an MLXCEL_CXX_MARCH override (default unchanged: native); the x86-64 assets pin x86-64-v3 (AVX2) for both the bridge and rustc. Build parallelism is capped at -j 3 on the 16 GB hosted runners because the CUTLASS-heavy qmm_*.cu kernels peak at ~4-5 GB of cicc memory per job (observed OOM at default -j). Verified locally on x86-64 Ubuntu 26.04 + CUDA 13.3 + RTX 5090 (sm_120a auto-detected): cargo build --release --features cuda --locked succeeds and mlxcel generate produces tokens on the GPU. Build prerequisites beyond the CUDA toolkit — libcudnn9-dev-cuda-13, libopenblas-dev, liblapacke-dev (lapacke.h; liblapack-dev alone is not enough) — and the qmm OOM mitigation are now documented in docs/installation.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: add Linux x86-64 CUDA release builds on GitHub-hosted runners#208

ci: add Linux x86-64 CUDA release builds on GitHub-hosted runners#208
achimnol wants to merge 1 commit into
mainfrom
topic/linux-x86-64-release

achimnol commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

achimnol commented Jun 10, 2026

Summary

Codebase review notes (why no other code changes were needed)

Local verification (x86-64, Ubuntu 26.04, CUDA 13.3, RTX 5090)

Verifying the workflow

Out of scope / follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant