Fix multi-threaded CAGRA search OOB#1781
Conversation
achirkin
left a comment
There was a problem hiding this comment.
Thanks, @viclafargue , the solution looks nice. Fingers crossed it indeed solves the bug in question :)
| fun(ptr, stream); | ||
| // Record an event after initialization so that other streams can establish | ||
| // a GPU-side dependency without expensive host synchronization. | ||
| RAFT_CUDA_TRY(cudaEventCreateWithFlags(&init_event, cudaEventDisableTiming)); |
There was a problem hiding this comment.
Do you know if this call ever synchronizes the device or the context?
In any case I think it could make sense to move the event creation above cudaMallocAsync (which sometimes does synchronize and sometimes doesn't).
There was a problem hiding this comment.
Do you know if this call ever synchronizes the device or the context?
cudaEventCreateWithFlags does not synchronize the device. It's a lightweight host-side allocation that only briefly acquires the CUDA context lock.
In any case I think it could make sense to move the event creation above cudaMallocAsync (which sometimes does synchronize and sometimes doesn't).
Makes sense, I just updated this.
| if (std::holds_alternative<init_f>(value)) { | ||
| auto& [fun, size] = std::get<init_f>(value); | ||
| dev_descriptor_t* ptr = nullptr; | ||
| RAFT_CUDA_TRY(cudaEventCreateWithFlags(&init_event, cudaEventDisableTiming)); |
There was a problem hiding this comment.
Based on the feedback received on my PR (https://github.com/rapidsai/cuvs/pull/1771/changes#diff-52f864438c5274d5d365954ba71d9988bb00f8c415e2b3c081aa532f3a1a8f07R34-R36), one wonders if debugging might be easier if one used RAFT_EXPECTS with a message here instead?
What is the generally accepted convention?
There was a problem hiding this comment.
Good question. It looks like the convention is to use RAFT_CUDA_TRY for error handling of CUDA API calls while RAFT_EXPECTS would most often be used for assertion testing. RAFT_CUDA_TRY should point out the file and line where the error happened. But, I guess that RAFT_EXPECTS may indeed come useful when one wants to display a specific error message.
Cherry-picked from upstream PR NVIDIA#1781. Fixes cudaErrorIllegalAddress under concurrent search by ensuring per-thread distance buffer indexing stays within bounds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CUDA 13 binary size reduction from 282 MB to 257 MB (-8.86%). Benchmark: <img width="1137" height="776" alt="image" src="https://github.com/user-attachments/assets/81b4f25a-c999-423d-a2c9-3e45566904bf" /> Apply updates from CAGRA related PRs: - [x] #1800 - [ ] #1781 - [x] #1780 - [x] #1771 - [x] #1834 - [x] #1851 - [x] #1831 JIT related PRs: - [x] #1812 Authors: - Divye Gala (https://github.com/divyegala) - Kyle Edwards (https://github.com/KyleFromNVIDIA) - Dante Gama Dessavre (https://github.com/dantegd) - Bradley Dice (https://github.com/bdice) Approvers: - Kyle Edwards (https://github.com/KyleFromNVIDIA) - Dante Gama Dessavre (https://github.com/dantegd) URL: #1807
📝 WalkthroughSummary by CodeRabbit
WalkthroughThe change adds CUDA event-based synchronization to ChangesGPU-side descriptor initialization synchronization
🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/src/neighbors/detail/cagra/compute_distance.hpp`:
- Around line 242-249: The code currently assigns init_event and ptr into outer
state before all subsequent calls complete, so failures in cudaMallocAsync,
fun(ptr, stream), or cudaEventRecord can leak the device pointer and event;
change the routine to keep local variables (e.g., local cudaEvent_t
init_event_local and void* ptr_local) and perform cudaEventCreateWithFlags,
cudaMallocAsync, fun(ptr_local, stream), and cudaEventRecord on those locals,
and only after all succeed assign value = std::make_tuple(ptr_local, stream) and
ready.store(...); on any failure catch/handle by destroying/freeing the local
event and freeing ptr_local (using RAII wrappers or explicit
cudaEventDestroy/cudaFreeAsync calls) before rethrowing so no GPU resources are
leaked.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c015c666-28b0-4923-aa41-e5060846f253
📒 Files selected for processing (1)
cpp/src/neighbors/detail/cagra/compute_distance.hpp
| RAFT_CUDA_TRY(cudaEventCreateWithFlags(&init_event, cudaEventDisableTiming)); | ||
| RAFT_CUDA_TRY(cudaMallocAsync(&ptr, size, stream)); | ||
| fun(ptr, stream); | ||
| // Record an event after initialization so that other streams can establish | ||
| // a GPU-side dependency without expensive host synchronization. | ||
| RAFT_CUDA_TRY(cudaEventRecord(init_event, stream)); | ||
| value = std::make_tuple(ptr, stream); | ||
| ready.store(true, std::memory_order_release); |
There was a problem hiding this comment.
Handle partial init failures without leaking event/device memory
If initialization fails after Line 242 (e.g., Line 243, Line 244, or Line 247), value stays in init_f, so the allocated ptr is not reclaimed, and a later retry can overwrite init_event and leak the previous handle. Please keep event/pointer local until success, and clean both in a failure path before rethrow.
Proposed fix
void eval(rmm::cuda_stream_view stream)
{
std::lock_guard<std::mutex> lock(mutex);
if (std::holds_alternative<init_f>(value)) {
auto& [fun, size] = std::get<init_f>(value);
dev_descriptor_t* ptr = nullptr;
- RAFT_CUDA_TRY(cudaEventCreateWithFlags(&init_event, cudaEventDisableTiming));
- RAFT_CUDA_TRY(cudaMallocAsync(&ptr, size, stream));
- fun(ptr, stream);
- // Record an event after initialization so that other streams can establish
- // a GPU-side dependency without expensive host synchronization.
- RAFT_CUDA_TRY(cudaEventRecord(init_event, stream));
- value = std::make_tuple(ptr, stream);
- ready.store(true, std::memory_order_release);
+ cudaEvent_t local_event{nullptr};
+ try {
+ RAFT_CUDA_TRY(cudaEventCreateWithFlags(&local_event, cudaEventDisableTiming));
+ RAFT_CUDA_TRY(cudaMallocAsync(&ptr, size, stream));
+ fun(ptr, stream);
+ // Record an event after initialization so that other streams can establish
+ // a GPU-side dependency without expensive host synchronization.
+ RAFT_CUDA_TRY(cudaEventRecord(local_event, stream));
+ init_event = local_event;
+ value = std::make_tuple(ptr, stream);
+ ready.store(true, std::memory_order_release);
+ } catch (...) {
+ if (ptr != nullptr) { RAFT_CUDA_TRY_NO_THROW(cudaFreeAsync(ptr, stream)); }
+ if (local_event != nullptr) { RAFT_CUDA_TRY_NO_THROW(cudaEventDestroy(local_event)); }
+ throw;
+ }
}
}As per coding guidelines: "Device memory allocations (cudaMalloc, RMM functions) must have corresponding deallocations; use RAII patterns ... to prevent GPU memory leaks on error paths".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/src/neighbors/detail/cagra/compute_distance.hpp` around lines 242 - 249,
The code currently assigns init_event and ptr into outer state before all
subsequent calls complete, so failures in cudaMallocAsync, fun(ptr, stream), or
cudaEventRecord can leak the device pointer and event; change the routine to
keep local variables (e.g., local cudaEvent_t init_event_local and void*
ptr_local) and perform cudaEventCreateWithFlags, cudaMallocAsync, fun(ptr_local,
stream), and cudaEventRecord on those locals, and only after all succeed assign
value = std::make_tuple(ptr_local, stream) and ready.store(...); on any failure
catch/handle by destroying/freeing the local event and freeing ptr_local (using
RAII wrappers or explicit cudaEventDestroy/cudaFreeAsync calls) before
rethrowing so no GPU resources are leaked.
There was a problem hiding this comment.
Hmm. This is a good point, IMHO. If the cudaEventRecord line throws, that's the last we'll see of the ptr allocation. :/
@viclafargue, do we have an RAII-wrapper specifically for cases like this one?
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
|
@viclafargue that's funny I completely forgot about this PR and that I reviewed it and I arrived at almost the same solution in #2237 |
Seems like this may be the correct solution then. It looks like yours is already approved, so we can merge it. |
Closes #1802
Fixes OOB that happens during multi-threaded CAGRA search.