Add scoped event-time director models#249
Open
sanjaychari wants to merge 7 commits into
Open
Conversation
Add scoped event-time model storage and inference so the dragonfly-dally event-time surrogate can use separate ML models for switch LPs while retaining the existing director request flow. Fix ZeroMQ director request argument handling so command handlers parse the argument-count prefix exactly once. This prevents client IDs such as 1 from being mistaken for a second argument count and dropped from training/inference requests. Restore the ZeroMQ director build path by compiling director-client.C when USE_ZMQML is enabled and propagating the USE_ZMQML compile definition to downstream targets. Clean up the director-client merge conflict around global ZMQ latency statistics and keep the cumulative MPI-reduced DIR_STATS output format. Expose a latency-recording hook so event-time inference requests from dragonfly-dally are included in the shared ZMQ request statistics. Update the event-time workflow to use START_ITER and END_ITER template variables and save/load the scoped event-time model directory rather than a single model file.
This commit formats files with clang-format-20 ,which is used by the CI, instead of just clang-format.
73d52b7 to
e937968
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR makes the following changes.
Add scoped event-time model storage and inference so the
dragonfly-dally event-time surrogate can use separate ML models for
switch LPs while retaining the existing director request flow.
Fix ZeroMQ director request argument handling so command handlers parse
the argument-count prefix exactly once. This prevents client IDs such as
1 from being mistaken for a second argument count and dropped from
training/inference requests.
Restore the ZeroMQ director build path by compiling director-client.C
when USE_ZMQML is enabled and propagating the USE_ZMQML compile
definition to downstream targets.
Clean up the director-client merge conflict around global ZMQ latency
statistics and keep the cumulative MPI-reduced DIR_STATS output format.
Expose a latency-recording hook so event-time inference requests from
dragonfly-dally are included in the shared ZMQ request statistics.
Update the event-time workflow to use START_ITER and END_ITER template
variables and save/load the scoped event-time model directory rather
than a single model file.