Draft multigpu ci for github from codex by coreyjadams · Pull Request #1790 · NVIDIA/physicsnemo

coreyjadams · 2026-07-03T19:16:54Z

PhysicsNeMo Pull Request

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.
If I am implementing a new model or modifying any existing model, I have followed the Models Implementation Coding Standards.

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

copy-pr-bot · 2026-07-03T19:16:57Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coreyjadams · 2026-07-03T19:20:36Z

/ok to test 0341d7e

greptile-apps · 2026-07-03T19:27:44Z

Greptile Summary

This PR adds a complete multi-GPU CI system for GitHub Actions, including a Python script (multigpu_ci.py) for impact-based test selection, a composite action for environment setup and test execution, a workflow with separate dynamic and static GPU streams, impact-manifest publisher jobs, and a merge-queue passthrough entry for the two new status checks.

Impact-gating logic in multigpu_ci.py compares PR file changes against nightly-cached manifests (coverage + JUnit data) and fails open on any uncertainty — missing manifests, stale data, API errors, or PR snapshot races all run the affected stream rather than silently skipping it.
Manifest validation is strict: schema version, stream, completeness flag, source SHA, timestamp, rank/runner-profile match, JUnit totals, and the coverage-to-test-file union must all pass before a manifest is accepted.
CACHE_CONTRACT.md is updated to document the two new multigpu-impact-{stream}-v1-latest mutable cache slots and their save/restore semantics.

Important Files Changed

Filename	Overview
.github/scripts/multigpu_ci.py	New 1,367-line Python script for multi-GPU CI utilities: AST-based marker discovery, manifest building/validation, PR impact gating, and GPU verification. Well-structured with fail-open semantics. One dead-code line after os.execv in _command_run_tests.
.github/workflows/github-multigpu.yml	New workflow with CPU selector job followed by dynamic and static GPU jobs; publisher jobs run only on schedule/approved default-branch dispatches. Concurrency groups, fail-open conditions, and permission scopes look correct.
.github/actions/run-multigpu-tests/action.yml	New composite action bootstrapping CUDA env, discovering tests, verifying GPU count, running tests (torchrun for static, plain Python for dynamic), and optionally building/uploading the impact manifest.
test/ci_tests/test_multigpu_ci.py	787-line test suite covering gate logic, manifest validation, AST discovery, JUnit parsing, and baseline-diff integration tests. Module-level bare assert for spec loader should be a proper RuntimeError guard.
.github/CACHE_CONTRACT.md	Documentation updated to cover the two new multi-GPU impact manifest cache slots, their restore/save semantics, and new concurrency and runner-inventory notes.
.github/workflows/merge-queue-blossom-passthrough.yml	Adds multigpu-dynamic and multigpu-static to the merge-queue passthrough commit-status list.
test/coverage.multigpu.rc	New coverage config for multi-GPU runs: parallel mode, subprocess patching, relative files, branch coverage, includes scoped to physicsnemo/ and test/.

_{Reviews (1): Last reviewed commit: "Draft multigpu ci for github from codex" | Re-trigger Greptile}

greptile-apps · 2026-07-03T19:27:48Z

+        files_payload=_load_optional_json(args.files_json),
+        manifests=manifests,


Dead code after exec_tests call

exec_tests unconditionally replaces the current process via os.execv, so return 0 on the line that follows is never executed. If exec_tests is ever refactored to not exec (e.g., for testability), the caller would silently return success regardless of the test outcome. The function signature and docstring both declare the no-return contract, so the unreachable line should be removed to avoid misleading future readers.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps · 2026-07-03T19:27:49Z

+"""Tests for the multi-GPU CI selection utility."""
+
+from __future__ import annotations
+
+import importlib.util


Module-level assert gives an opaque failure on script-not-found

If SCRIPT doesn't exist or the spec loader is None, the bare assert raises a bare AssertionError with no diagnostic text. When Python is run with -O or -OO, assert statements are compiled away entirely, turning a missing-script condition into an AttributeError on the None loader later. Using a conditional raise RuntimeError(...) guard would produce clear, optimisation-safe error messages at import time.

Draft multigpu ci for github from codex

0341d7e

coreyjadams requested review from NickGeneva and ktangsali as code owners July 3, 2026 19:16

greptile-apps Bot reviewed Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Draft multigpu ci for github from codex#1790

Draft multigpu ci for github from codex#1790
coreyjadams wants to merge 1 commit into
mainfrom
multigpu-ci

coreyjadams commented Jul 3, 2026

Uh oh!

copy-pr-bot Bot commented Jul 3, 2026

Uh oh!

coreyjadams commented Jul 3, 2026

Uh oh!

greptile-apps Bot commented Jul 3, 2026

Uh oh!

greptile-apps Bot Jul 3, 2026

Uh oh!

greptile-apps Bot Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		files_payload=_load_optional_json(args.files_json),
		manifests=manifests,

Uh oh!

Conversation

coreyjadams commented Jul 3, 2026

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Review Process

Uh oh!

copy-pr-bot Bot commented Jul 3, 2026

Uh oh!

coreyjadams commented Jul 3, 2026

Uh oh!

greptile-apps Bot commented Jul 3, 2026

Greptile Summary

Important Files Changed

Uh oh!

greptile-apps Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant