Skip to content

ci: add Linux x86-64 CUDA release builds on GitHub-hosted runners#208

Open
achimnol wants to merge 1 commit into
mainfrom
topic/linux-x86-64-release
Open

ci: add Linux x86-64 CUDA release builds on GitHub-hosted runners#208
achimnol wants to merge 1 commit into
mainfrom
topic/linux-x86-64-release

Conversation

@achimnol

Copy link
Copy Markdown
Member

Summary

Adds a Linux x86-64 release variant so builds can be produced and verified beyond the current macOS aarch64 / Linux aarch64 (GB10, GH200) targets.

  • New build-linux-x86-cuda release job on GitHub-hosted ubuntu-24.04 runners, producing three CUDA 13 assets:
    • mlxcel-linux-x86_64-cuda13-sm80 (A100)
    • mlxcel-linux-x86_64-cuda13-sm90a (H100/H200; 90a gates MLX's qmm_sm90 kernel, same as the gh200 asset)
    • mlxcel-linux-x86_64-cuda13-sm120 (RTX 50-series / RTX PRO Blackwell)
  • No GPU runner needed: compiling CUDA only requires the toolkit (MLX upstream CI builds its CUDA wheels on GPU-less runners the same way), and the release workflow intentionally runs no tests. Runtime verification happens out-of-band on real hardware.
  • MLXCEL_CXX_MARCH build.rs override: the bridge C++ previously hardcoded -march=native, which is correct for per-machine assets (gb10/gh200) but would bake the hosted runner's ISA (possibly AVX-512) into a redistributable binary and SIGILL on older CPUs. The x86-64 assets pin x86-64-v3 (AVX2) for both the bridge and rustc. Default behavior is unchanged (native).
  • -j 3 parallelism cap on the 16 GB hosted runners: the CUTLASS-heavy qmm_*.cu kernels peak at ~4-5 GB of cicc memory per parallel job; default -j$(nproc) OOMs (observed locally as gmake Error 137).
  • notify-teams and promote-release now also gate on the new job.
  • Docs: MLXCEL_CXX_MARCH reference, x86-64 build prerequisites, and troubleshooting entries for the qmm OOM and the liblapacke-dev requirement.

Codebase review notes (why no other code changes were needed)

  • mlx_cxx_bridge.h includes mlx/mlx.h, which declares mlx::core::metal::* on every platform; MLX's no_metal.cpp provides no-op stubs on non-Metal builds, so the Metal capture helpers compile and link on Linux as-is (already proven by the aarch64 CUDA assets).
  • The turbo Metal-JIT kernels (fast::metal_kernel) throw on non-Metal backends, but only behind the opt-in Turbo4Asym cache modes — identical exposure to the existing aarch64 Linux assets; not an x86-64 regression (tracked separately as Linux hardening).
  • All platform gates in the tree are OS-conditional (__APPLE__, target_os), none arch-conditional; the CUDA patches runtime-dispatch Sm75/Sm80/Sm90 CUTLASS paths.

Local verification (x86-64, Ubuntu 26.04, CUDA 13.3, RTX 5090)

  • cargo build --release --features cuda --locked succeeds; MLX_CUDA_ARCHITECTURES auto-detect correctly chose 120a.
  • Prerequisites beyond the toolkit: libcudnn9-dev-cuda-13, libopenblas-dev, liblapacke-dev (CMake fails with LAPACK_INCLUDE_DIRS NOTFOUND without lapacke.h).
  • Smoke test: mlxcel generate -m Qwen3-0.6B-4bit -p "Hello" -n 16Runtime device: NVIDIA GPU (CUDA), 9 tokens generated.

Verifying the workflow

A tag-less workflow_dispatch run with targets=linux is the documented build-only mode: it exercises the new job (and the aarch64 jobs) without uploading assets. The new job skips upload when release_tag is empty.

Out of scope / follow-ups

  • cargo/cmake caching for the hosted job (first iteration builds cold, ~within the 300 min timeout).
  • Gating the turbo4 sparse-V kernel path on Metal availability (pre-existing on all Linux builds).
  • CPU-only x86-64 asset (the debian/PPA packaging already covers that path).

Add a build-linux-x86-cuda release job that produces x86-64 Linux CUDA 13
assets for SM 80 (A100), SM 90a (H100/H200), and SM 120 (RTX 50-series /
RTX PRO Blackwell). Unlike the self-hosted aarch64 jobs, these run on
standard GitHub-hosted ubuntu-24.04 runners: compiling CUDA only needs the
toolkit (nvcc compiles without a device; MLX upstream CI builds its CUDA
wheels the same way), and the release workflow intentionally runs no tests.

Redistributable binaries cannot inherit the build host's ISA, so the
bridge C++'s unconditional -march=native gains an MLXCEL_CXX_MARCH
override (default unchanged: native); the x86-64 assets pin x86-64-v3
(AVX2) for both the bridge and rustc. Build parallelism is capped at -j 3
on the 16 GB hosted runners because the CUTLASS-heavy qmm_*.cu kernels
peak at ~4-5 GB of cicc memory per job (observed OOM at default -j).

Verified locally on x86-64 Ubuntu 26.04 + CUDA 13.3 + RTX 5090 (sm_120a
auto-detected): cargo build --release --features cuda --locked succeeds
and mlxcel generate produces tokens on the GPU. Build prerequisites
beyond the CUDA toolkit — libcudnn9-dev-cuda-13, libopenblas-dev,
liblapacke-dev (lapacke.h; liblapack-dev alone is not enough) — and the
qmm OOM mitigation are now documented in docs/installation.md.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant