ci: add Linux x86-64 CUDA release builds on GitHub-hosted runners#208
Open
achimnol wants to merge 1 commit into
Open
ci: add Linux x86-64 CUDA release builds on GitHub-hosted runners#208achimnol wants to merge 1 commit into
achimnol wants to merge 1 commit into
Conversation
Add a build-linux-x86-cuda release job that produces x86-64 Linux CUDA 13 assets for SM 80 (A100), SM 90a (H100/H200), and SM 120 (RTX 50-series / RTX PRO Blackwell). Unlike the self-hosted aarch64 jobs, these run on standard GitHub-hosted ubuntu-24.04 runners: compiling CUDA only needs the toolkit (nvcc compiles without a device; MLX upstream CI builds its CUDA wheels the same way), and the release workflow intentionally runs no tests. Redistributable binaries cannot inherit the build host's ISA, so the bridge C++'s unconditional -march=native gains an MLXCEL_CXX_MARCH override (default unchanged: native); the x86-64 assets pin x86-64-v3 (AVX2) for both the bridge and rustc. Build parallelism is capped at -j 3 on the 16 GB hosted runners because the CUTLASS-heavy qmm_*.cu kernels peak at ~4-5 GB of cicc memory per job (observed OOM at default -j). Verified locally on x86-64 Ubuntu 26.04 + CUDA 13.3 + RTX 5090 (sm_120a auto-detected): cargo build --release --features cuda --locked succeeds and mlxcel generate produces tokens on the GPU. Build prerequisites beyond the CUDA toolkit — libcudnn9-dev-cuda-13, libopenblas-dev, liblapacke-dev (lapacke.h; liblapack-dev alone is not enough) — and the qmm OOM mitigation are now documented in docs/installation.md.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a Linux x86-64 release variant so builds can be produced and verified beyond the current macOS aarch64 / Linux aarch64 (GB10, GH200) targets.
build-linux-x86-cudarelease job on GitHub-hostedubuntu-24.04runners, producing three CUDA 13 assets:mlxcel-linux-x86_64-cuda13-sm80(A100)mlxcel-linux-x86_64-cuda13-sm90a(H100/H200;90agates MLX'sqmm_sm90kernel, same as the gh200 asset)mlxcel-linux-x86_64-cuda13-sm120(RTX 50-series / RTX PRO Blackwell)MLXCEL_CXX_MARCHbuild.rs override: the bridge C++ previously hardcoded-march=native, which is correct for per-machine assets (gb10/gh200) but would bake the hosted runner's ISA (possibly AVX-512) into a redistributable binary and SIGILL on older CPUs. The x86-64 assets pinx86-64-v3(AVX2) for both the bridge and rustc. Default behavior is unchanged (native).-j 3parallelism cap on the 16 GB hosted runners: the CUTLASS-heavyqmm_*.cukernels peak at ~4-5 GB ofciccmemory per parallel job; default-j$(nproc)OOMs (observed locally asgmake Error 137).notify-teamsandpromote-releasenow also gate on the new job.MLXCEL_CXX_MARCHreference, x86-64 build prerequisites, and troubleshooting entries for the qmm OOM and theliblapacke-devrequirement.Codebase review notes (why no other code changes were needed)
mlx_cxx_bridge.hincludesmlx/mlx.h, which declaresmlx::core::metal::*on every platform; MLX'sno_metal.cppprovides no-op stubs on non-Metal builds, so the Metal capture helpers compile and link on Linux as-is (already proven by the aarch64 CUDA assets).fast::metal_kernel) throw on non-Metal backends, but only behind the opt-inTurbo4Asymcache modes — identical exposure to the existing aarch64 Linux assets; not an x86-64 regression (tracked separately as Linux hardening).__APPLE__,target_os), none arch-conditional; the CUDA patches runtime-dispatch Sm75/Sm80/Sm90 CUTLASS paths.Local verification (x86-64, Ubuntu 26.04, CUDA 13.3, RTX 5090)
cargo build --release --features cuda --lockedsucceeds;MLX_CUDA_ARCHITECTURESauto-detect correctly chose120a.libcudnn9-dev-cuda-13,libopenblas-dev,liblapacke-dev(CMake fails withLAPACK_INCLUDE_DIRS NOTFOUNDwithoutlapacke.h).mlxcel generate -m Qwen3-0.6B-4bit -p "Hello" -n 16→Runtime device: NVIDIA GPU (CUDA), 9 tokens generated.Verifying the workflow
A tag-less
workflow_dispatchrun withtargets=linuxis the documented build-only mode: it exercises the new job (and the aarch64 jobs) without uploading assets. The new job skips upload whenrelease_tagis empty.Out of scope / follow-ups