Skip to content

Fix multi-threaded CAGRA search OOB#1781

Open
viclafargue wants to merge 3 commits into
NVIDIA:mainfrom
viclafargue:fix-mg-cagra-search-oob
Open

Fix multi-threaded CAGRA search OOB#1781
viclafargue wants to merge 3 commits into
NVIDIA:mainfrom
viclafargue:fix-mg-cagra-search-oob

Conversation

@viclafargue

@viclafargue viclafargue commented Feb 9, 2026

Copy link
Copy Markdown
Contributor

Closes #1802

Fixes OOB that happens during multi-threaded CAGRA search.

========= Invalid __global__ read of size 8 bytes
=========     at void cuvs::neighbors::cagra::detail::single_cta_search::search_kernel<(unsigned int)64, (unsigned int)64, (unsigned int)1, cuvs::neighbors::cagra::detail::dataset_descriptor_base_t<float, unsigned int, float>, unsigned int, cuvs::neighbors::filtering::none_sample_filter>(unsigned long, T4::DISTANCE_T *, unsigned int, const T4*, const T4::DATA_T *, const T4::INDEX_T *, unsigned int, const T5 *, unsigned int, unsigned long, const T4::INDEX_T *, unsigned int, T4::INDEX_T *, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int *, unsigned int, unsigned int, unsigned int, T6)+0x40
=========     by thread (0,0,0) in block (0,15,0)
=========     Access to 0x6290000020 is potentially made before memory is allocated
=========     and is inside the nearest allocation at 0x6290000000 of size 64 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: cuLaunchKernel_ptsz [0x3534c4] in libcuda.so.1
=========         Host Frame:  [0x14e7c] in libcudart.so.13
=========         Host Frame: cudaLaunchKernel_ptsz [0x572b7] in libcudart.so.13

@achirkin achirkin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @viclafargue , the solution looks nice. Fingers crossed it indeed solves the bug in question :)

fun(ptr, stream);
// Record an event after initialization so that other streams can establish
// a GPU-side dependency without expensive host synchronization.
RAFT_CUDA_TRY(cudaEventCreateWithFlags(&init_event, cudaEventDisableTiming));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if this call ever synchronizes the device or the context?
In any case I think it could make sense to move the event creation above cudaMallocAsync (which sometimes does synchronize and sometimes doesn't).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if this call ever synchronizes the device or the context?

cudaEventCreateWithFlags does not synchronize the device. It's a lightweight host-side allocation that only briefly acquires the CUDA context lock.

In any case I think it could make sense to move the event creation above cudaMallocAsync (which sometimes does synchronize and sometimes doesn't).

Makes sense, I just updated this.

@viclafargue viclafargue self-assigned this Feb 9, 2026
@viclafargue viclafargue added bug Something isn't working non-breaking Introduces a non-breaking change labels Feb 9, 2026
if (std::holds_alternative<init_f>(value)) {
auto& [fun, size] = std::get<init_f>(value);
dev_descriptor_t* ptr = nullptr;
RAFT_CUDA_TRY(cudaEventCreateWithFlags(&init_event, cudaEventDisableTiming));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the feedback received on my PR (https://github.com/rapidsai/cuvs/pull/1771/changes#diff-52f864438c5274d5d365954ba71d9988bb00f8c415e2b3c081aa532f3a1a8f07R34-R36), one wonders if debugging might be easier if one used RAFT_EXPECTS with a message here instead?

What is the generally accepted convention?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. It looks like the convention is to use RAFT_CUDA_TRY for error handling of CUDA API calls while RAFT_EXPECTS would most often be used for assertion testing. RAFT_CUDA_TRY should point out the file and line where the error happened. But, I guess that RAFT_EXPECTS may indeed come useful when one wants to display a specific error message.

@divyegala divyegala mentioned this pull request Feb 16, 2026
8 tasks
Ludu-nuvai added a commit to Nuvai/cuvs that referenced this pull request Mar 23, 2026
Cherry-picked from upstream PR NVIDIA#1781.
Fixes cudaErrorIllegalAddress under concurrent search by ensuring
per-thread distance buffer indexing stays within bounds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
rapids-bot Bot pushed a commit that referenced this pull request May 9, 2026
CUDA 13 binary size reduction from 282 MB to 257 MB (-8.86%).

Benchmark:
<img width="1137" height="776" alt="image" src="https://github.com/user-attachments/assets/81b4f25a-c999-423d-a2c9-3e45566904bf" />

Apply updates from 
CAGRA related PRs:

- [x] #1800
- [ ] #1781
- [x] #1780
- [x] #1771
- [x] #1834
- [x] #1851
- [x] #1831

JIT related PRs:
- [x] #1812

Authors:
  - Divye Gala (https://github.com/divyegala)
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)
  - Dante Gama Dessavre (https://github.com/dantegd)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: #1807
@coderabbitai

coderabbitai Bot commented May 13, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • Performance

    • Improved GPU synchronization efficiency in nearest neighbor search operations to enhance computational performance.
  • Refactor

    • Optimized GPU event handling mechanisms to reduce synchronization overhead during initialization and data access.

Walkthrough

The change adds CUDA event-based synchronization to dataset_descriptor_host::state to coordinate descriptor initialization across GPU streams. A non-timing event is created during eval(), recorded after initialization completes, and later waited on in get() so callers synchronize GPU-side without blocking the host.

Changes

GPU-side descriptor initialization synchronization

Layer / File(s) Summary
Event state member and lifecycle
cpp/src/neighbors/detail/cagra/compute_distance.hpp
Add cudaEvent_t init_event member to store the initialization completion event, and update the destructor to destroy the event resource when the state is destroyed.
Event creation and recording during initialization
cpp/src/neighbors/detail/cagra/compute_distance.hpp
In eval(), create a non-timing CUDA event before descriptor allocation and record it on the init stream after the allocation/init callback to signal initialization completion to other streams.
Stream synchronization on descriptor access
cpp/src/neighbors/detail/cagra/compute_distance.hpp
In get(), wait on the initialization event using cudaStreamWaitEvent on the caller's stream to establish a GPU-side dependency before returning the descriptor pointer.

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix multi-threaded CAGRA search OOB' directly addresses the main objective of fixing an out-of-bounds access in multi-threaded CAGRA search, which aligns with the changeset's purpose.
Description check ✅ Passed The description references issue #1802 and explains the OOB bug being fixed, which is directly related to the code changes that address memory synchronization issues.
Linked Issues check ✅ Passed The changeset implements GPU memory synchronization via CUDA events to prevent out-of-bounds memory access during multi-threaded CAGRA search, directly addressing the OOB read issue documented in #1802.
Out of Scope Changes check ✅ Passed The modification to dataset_descriptor_host::state is focused solely on fixing the synchronization issue that causes the OOB access, with no unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/src/neighbors/detail/cagra/compute_distance.hpp`:
- Around line 242-249: The code currently assigns init_event and ptr into outer
state before all subsequent calls complete, so failures in cudaMallocAsync,
fun(ptr, stream), or cudaEventRecord can leak the device pointer and event;
change the routine to keep local variables (e.g., local cudaEvent_t
init_event_local and void* ptr_local) and perform cudaEventCreateWithFlags,
cudaMallocAsync, fun(ptr_local, stream), and cudaEventRecord on those locals,
and only after all succeed assign value = std::make_tuple(ptr_local, stream) and
ready.store(...); on any failure catch/handle by destroying/freeing the local
event and freeing ptr_local (using RAII wrappers or explicit
cudaEventDestroy/cudaFreeAsync calls) before rethrowing so no GPU resources are
leaked.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c015c666-28b0-4923-aa41-e5060846f253

📥 Commits

Reviewing files that changed from the base of the PR and between 63ed6ec and d48e73e.

📒 Files selected for processing (1)
  • cpp/src/neighbors/detail/cagra/compute_distance.hpp

Comment on lines +242 to 249
RAFT_CUDA_TRY(cudaEventCreateWithFlags(&init_event, cudaEventDisableTiming));
RAFT_CUDA_TRY(cudaMallocAsync(&ptr, size, stream));
fun(ptr, stream);
// Record an event after initialization so that other streams can establish
// a GPU-side dependency without expensive host synchronization.
RAFT_CUDA_TRY(cudaEventRecord(init_event, stream));
value = std::make_tuple(ptr, stream);
ready.store(true, std::memory_order_release);

@coderabbitai coderabbitai Bot May 13, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle partial init failures without leaking event/device memory

If initialization fails after Line 242 (e.g., Line 243, Line 244, or Line 247), value stays in init_f, so the allocated ptr is not reclaimed, and a later retry can overwrite init_event and leak the previous handle. Please keep event/pointer local until success, and clean both in a failure path before rethrow.

Proposed fix
 void eval(rmm::cuda_stream_view stream)
 {
   std::lock_guard<std::mutex> lock(mutex);
   if (std::holds_alternative<init_f>(value)) {
     auto& [fun, size]     = std::get<init_f>(value);
     dev_descriptor_t* ptr = nullptr;
-    RAFT_CUDA_TRY(cudaEventCreateWithFlags(&init_event, cudaEventDisableTiming));
-    RAFT_CUDA_TRY(cudaMallocAsync(&ptr, size, stream));
-    fun(ptr, stream);
-    // Record an event after initialization so that other streams can establish
-    // a GPU-side dependency without expensive host synchronization.
-    RAFT_CUDA_TRY(cudaEventRecord(init_event, stream));
-    value = std::make_tuple(ptr, stream);
-    ready.store(true, std::memory_order_release);
+    cudaEvent_t local_event{nullptr};
+    try {
+      RAFT_CUDA_TRY(cudaEventCreateWithFlags(&local_event, cudaEventDisableTiming));
+      RAFT_CUDA_TRY(cudaMallocAsync(&ptr, size, stream));
+      fun(ptr, stream);
+      // Record an event after initialization so that other streams can establish
+      // a GPU-side dependency without expensive host synchronization.
+      RAFT_CUDA_TRY(cudaEventRecord(local_event, stream));
+      init_event = local_event;
+      value      = std::make_tuple(ptr, stream);
+      ready.store(true, std::memory_order_release);
+    } catch (...) {
+      if (ptr != nullptr) { RAFT_CUDA_TRY_NO_THROW(cudaFreeAsync(ptr, stream)); }
+      if (local_event != nullptr) { RAFT_CUDA_TRY_NO_THROW(cudaEventDestroy(local_event)); }
+      throw;
+    }
   }
 }

As per coding guidelines: "Device memory allocations (cudaMalloc, RMM functions) must have corresponding deallocations; use RAII patterns ... to prevent GPU memory leaks on error paths".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/neighbors/detail/cagra/compute_distance.hpp` around lines 242 - 249,
The code currently assigns init_event and ptr into outer state before all
subsequent calls complete, so failures in cudaMallocAsync, fun(ptr, stream), or
cudaEventRecord can leak the device pointer and event; change the routine to
keep local variables (e.g., local cudaEvent_t init_event_local and void*
ptr_local) and perform cudaEventCreateWithFlags, cudaMallocAsync, fun(ptr_local,
stream), and cudaEventRecord on those locals, and only after all succeed assign
value = std::make_tuple(ptr_local, stream) and ready.store(...); on any failure
catch/handle by destroying/freeing the local event and freeing ptr_local (using
RAII wrappers or explicit cudaEventDestroy/cudaFreeAsync calls) before
rethrowing so no GPU resources are leaked.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. This is a good point, IMHO. If the cudaEventRecord line throws, that's the last we'll see of the ptr allocation. :/

@viclafargue, do we have an RAII-wrapper specifically for cases like this one?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

@achirkin

Copy link
Copy Markdown
Contributor

@viclafargue that's funny I completely forgot about this PR and that I reviewed it and I arrived at almost the same solution in #2237

@viclafargue

Copy link
Copy Markdown
Contributor Author

@viclafargue that's funny I completely forgot about this PR and that I reviewed it and I arrived at almost the same solution in #2237

Seems like this may be the correct solution then. It looks like yours is already approved, so we can merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] OOB access in multi-threaded CAGRA search

4 participants