Skip to content

Draft multigpu ci for github from codex#1790

Open
coreyjadams wants to merge 1 commit into
mainfrom
multigpu-ci
Open

Draft multigpu ci for github from codex#1790
coreyjadams wants to merge 1 commit into
mainfrom
multigpu-ci

Conversation

@coreyjadams

Copy link
Copy Markdown
Collaborator

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

@copy-pr-bot

copy-pr-bot Bot commented Jul 3, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coreyjadams

Copy link
Copy Markdown
Collaborator Author

/ok to test 0341d7e

@greptile-apps

greptile-apps Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a complete multi-GPU CI system for GitHub Actions, including a Python script (multigpu_ci.py) for impact-based test selection, a composite action for environment setup and test execution, a workflow with separate dynamic and static GPU streams, impact-manifest publisher jobs, and a merge-queue passthrough entry for the two new status checks.

  • Impact-gating logic in multigpu_ci.py compares PR file changes against nightly-cached manifests (coverage + JUnit data) and fails open on any uncertainty — missing manifests, stale data, API errors, or PR snapshot races all run the affected stream rather than silently skipping it.
  • Manifest validation is strict: schema version, stream, completeness flag, source SHA, timestamp, rank/runner-profile match, JUnit totals, and the coverage-to-test-file union must all pass before a manifest is accepted.
  • CACHE_CONTRACT.md is updated to document the two new multigpu-impact-{stream}-v1-latest mutable cache slots and their save/restore semantics.

Important Files Changed

Filename Overview
.github/scripts/multigpu_ci.py New 1,367-line Python script for multi-GPU CI utilities: AST-based marker discovery, manifest building/validation, PR impact gating, and GPU verification. Well-structured with fail-open semantics. One dead-code line after os.execv in _command_run_tests.
.github/workflows/github-multigpu.yml New workflow with CPU selector job followed by dynamic and static GPU jobs; publisher jobs run only on schedule/approved default-branch dispatches. Concurrency groups, fail-open conditions, and permission scopes look correct.
.github/actions/run-multigpu-tests/action.yml New composite action bootstrapping CUDA env, discovering tests, verifying GPU count, running tests (torchrun for static, plain Python for dynamic), and optionally building/uploading the impact manifest.
test/ci_tests/test_multigpu_ci.py 787-line test suite covering gate logic, manifest validation, AST discovery, JUnit parsing, and baseline-diff integration tests. Module-level bare assert for spec loader should be a proper RuntimeError guard.
.github/CACHE_CONTRACT.md Documentation updated to cover the two new multi-GPU impact manifest cache slots, their restore/save semantics, and new concurrency and runner-inventory notes.
.github/workflows/merge-queue-blossom-passthrough.yml Adds multigpu-dynamic and multigpu-static to the merge-queue passthrough commit-status list.
test/coverage.multigpu.rc New coverage config for multi-GPU runs: parallel mode, subprocess patching, relative files, branch coverage, includes scoped to physicsnemo/ and test/.

Reviews (1): Last reviewed commit: "Draft multigpu ci for github from codex" | Re-trigger Greptile

Comment on lines +1278 to +1279
files_payload=_load_optional_json(args.files_json),
manifests=manifests,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Dead code after exec_tests call

exec_tests unconditionally replaces the current process via os.execv, so return 0 on the line that follows is never executed. If exec_tests is ever refactored to not exec (e.g., for testability), the caller would silently return success regardless of the test outcome. The function signature and docstring both declare the no-return contract, so the unreachable line should be removed to avoid misleading future readers.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +17 to +21
"""Tests for the multi-GPU CI selection utility."""

from __future__ import annotations

import importlib.util

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Module-level assert gives an opaque failure on script-not-found

If SCRIPT doesn't exist or the spec loader is None, the bare assert raises a bare AssertionError with no diagnostic text. When Python is run with -O or -OO, assert statements are compiled away entirely, turning a missing-script condition into an AttributeError on the None loader later. Using a conditional raise RuntimeError(...) guard would produce clear, optimisation-safe error messages at import time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant