Skip to content

[fix] allow exporting custom Metal kernels without a Metal backend#3650

Open
xthomaswang wants to merge 1 commit into
ml-explore:mainfrom
xthomaswang:fix/no-metal-custom-kernel-export
Open

[fix] allow exporting custom Metal kernels without a Metal backend#3650
xthomaswang wants to merge 1 commit into
ml-explore:mainfrom
xthomaswang:fix/no-metal-custom-kernel-export

Conversation

@xthomaswang

Copy link
Copy Markdown

Proposed changes

Fixes #3240

mx.fast.metal_kernel throws at construction on non-Metal builds, so functions
using custom Metal kernels can't be exported from Linux.

  • Move metal_kernel graph construction to backend/common so it builds on all platforms
  • During export tracing without Metal, record the kernel on a placeholder GPU stream which the importing process remaps to a real stream
  • Outside of export tracing, calling the kernel without Metal still raises an error

The same mechanism could support exporting mx.fast.cuda_kernel from non-CUDA
platforms; happy to do that as a follow-up.

Checklist

Put an x in the boxes that apply.

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)

@xthomaswang

Copy link
Copy Markdown
Author

This now conflicts with #3638, which made the tracing state thread-local. My plan is to rebase on main and make the new InExportTracing counter thread-local as well to match the rest of the tracing state, let me know if sounds good, or do you suggest any other approach.

@zcbenz

zcbenz commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

This basically looks good to me, a rebase and changing the counter to threadlcoal would be enough.

- Move metal_kernel graph construction to backend/common so it builds on
  all platforms
- During export tracing without Metal, record the kernel on a placeholder
  GPU stream which the importing process remaps to a real stream
- Outside of export tracing, calling the kernel without Metal still raises
  an error

Fixes ml-explore#3240
@xthomaswang xthomaswang force-pushed the fix/no-metal-custom-kernel-export branch from a431ede to 7cbaeac Compare June 14, 2026 02:00
@xthomaswang

Copy link
Copy Markdown
Author

This basically looks good to me, a rebase and changing the counter to threadlcoal would be enough.

Done, rebase on main and change counter to threadlocal.

if ((device && *device != Device::gpu) ||
(stream && stream->device != Device::gpu) ||
(tl_stream && tl_stream->device != Device::gpu)) {
throw std::invalid_argument("[metal_kernel] Only supports the GPU.");

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to require a gpu stream here? Can we just return to_stream(s) for exporting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Fallback for custom kernels

2 participants