Fix multi-threaded CAGRA search OOB by viclafargue · Pull Request #1781 · NVIDIA/cuvs

viclafargue · 2026-02-09T10:49:27Z

Fixes OOB that happens during multi-threaded CAGRA search.

========= Invalid __global__ read of size 8 bytes
=========     at void cuvs::neighbors::cagra::detail::single_cta_search::search_kernel<(unsigned int)64, (unsigned int)64, (unsigned int)1, cuvs::neighbors::cagra::detail::dataset_descriptor_base_t<float, unsigned int, float>, unsigned int, cuvs::neighbors::filtering::none_sample_filter>(unsigned long, T4::DISTANCE_T *, unsigned int, const T4*, const T4::DATA_T *, const T4::INDEX_T *, unsigned int, const T5 *, unsigned int, unsigned long, const T4::INDEX_T *, unsigned int, T4::INDEX_T *, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int *, unsigned int, unsigned int, unsigned int, T6)+0x40
=========     by thread (0,0,0) in block (0,15,0)
=========     Access to 0x6290000020 is potentially made before memory is allocated
=========     and is inside the nearest allocation at 0x6290000000 of size 64 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame: cuLaunchKernel_ptsz [0x3534c4] in libcuda.so.1
=========         Host Frame:  [0x14e7c] in libcudart.so.13
=========         Host Frame: cudaLaunchKernel_ptsz [0x572b7] in libcudart.so.13

achirkin

Thanks, @viclafargue , the solution looks nice. Fingers crossed it indeed solves the bug in question :)

achirkin · 2026-02-09T11:07:48Z

        fun(ptr, stream);
+        // Record an event after initialization so that other streams can establish
+        // a GPU-side dependency without expensive host synchronization.
+        RAFT_CUDA_TRY(cudaEventCreateWithFlags(&init_event, cudaEventDisableTiming));


Do you know if this call ever synchronizes the device or the context?
In any case I think it could make sense to move the event creation above cudaMallocAsync (which sometimes does synchronize and sometimes doesn't).

Do you know if this call ever synchronizes the device or the context?

cudaEventCreateWithFlags does not synchronize the device. It's a lightweight host-side allocation that only briefly acquires the CUDA context lock.

In any case I think it could make sense to move the event creation above cudaMallocAsync (which sometimes does synchronize and sometimes doesn't).

Makes sense, I just updated this.

mythrocks · 2026-02-11T23:47:05Z

      if (std::holds_alternative<init_f>(value)) {
        auto& [fun, size]     = std::get<init_f>(value);
        dev_descriptor_t* ptr = nullptr;
+        RAFT_CUDA_TRY(cudaEventCreateWithFlags(&init_event, cudaEventDisableTiming));


Based on the feedback received on my PR (https://github.com/rapidsai/cuvs/pull/1771/changes#diff-52f864438c5274d5d365954ba71d9988bb00f8c415e2b3c081aa532f3a1a8f07R34-R36), one wonders if debugging might be easier if one used RAFT_EXPECTS with a message here instead?

What is the generally accepted convention?

Good question. It looks like the convention is to use RAFT_CUDA_TRY for error handling of CUDA API calls while RAFT_EXPECTS would most often be used for assertion testing. RAFT_CUDA_TRY should point out the file and line where the error happened. But, I guess that RAFT_EXPECTS may indeed come useful when one wants to display a specific error message.

Cherry-picked from upstream PR NVIDIA#1781. Fixes cudaErrorIllegalAddress under concurrent search by ensuring per-thread distance buffer indexing stays within bounds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CUDA 13 binary size reduction from 282 MB to 257 MB (-8.86%). Benchmark: <img width="1137" height="776" alt="image" src="https://github.com/user-attachments/assets/81b4f25a-c999-423d-a2c9-3e45566904bf" /> Apply updates from CAGRA related PRs: - [x] #1800 - [ ] #1781 - [x] #1780 - [x] #1771 - [x] #1834 - [x] #1851 - [x] #1831 JIT related PRs: - [x] #1812 Authors: - Divye Gala (https://github.com/divyegala) - Kyle Edwards (https://github.com/KyleFromNVIDIA) - Dante Gama Dessavre (https://github.com/dantegd) - Bradley Dice (https://github.com/bdice) Approvers: - Kyle Edwards (https://github.com/KyleFromNVIDIA) - Dante Gama Dessavre (https://github.com/dantegd) URL: #1807

coderabbitai · 2026-05-13T17:54:14Z

📝 Walkthrough

Summary by CodeRabbit

Performance
- Improved GPU synchronization efficiency in nearest neighbor search operations to enhance computational performance.
Refactor
- Optimized GPU event handling mechanisms to reduce synchronization overhead during initialization and data access.

Walkthrough

The change adds CUDA event-based synchronization to dataset_descriptor_host::state to coordinate descriptor initialization across GPU streams. A non-timing event is created during eval(), recorded after initialization completes, and later waited on in get() so callers synchronize GPU-side without blocking the host.

Changes

GPU-side descriptor initialization synchronization

Layer / File(s)	Summary
Event state member and lifecycle `cpp/src/neighbors/detail/cagra/compute_distance.hpp`	Add `cudaEvent_t init_event` member to store the initialization completion event, and update the destructor to destroy the event resource when the state is destroyed.
Event creation and recording during initialization `cpp/src/neighbors/detail/cagra/compute_distance.hpp`	In `eval()`, create a non-timing CUDA event before descriptor allocation and record it on the init stream after the allocation/init callback to signal initialization completion to other streams.
Stream synchronization on descriptor access `cpp/src/neighbors/detail/cagra/compute_distance.hpp`	In `get()`, wait on the initialization event using `cudaStreamWaitEvent` on the caller's stream to establish a GPU-side dependency before returning the descriptor pointer.

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Fix multi-threaded CAGRA search OOB' directly addresses the main objective of fixing an out-of-bounds access in multi-threaded CAGRA search, which aligns with the changeset's purpose.
Description check	✅ Passed	The description references issue `#1802` and explains the OOB bug being fixed, which is directly related to the code changes that address memory synchronization issues.
Linked Issues check	✅ Passed	The changeset implements GPU memory synchronization via CUDA events to prevent out-of-bounds memory access during multi-threaded CAGRA search, directly addressing the OOB read issue documented in `#1802`.
Out of Scope Changes check	✅ Passed	The modification to dataset_descriptor_host::state is focused solely on fixing the synchronization issue that causes the OOB access, with no unrelated changes detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/src/neighbors/detail/cagra/compute_distance.hpp`:
- Around line 242-249: The code currently assigns init_event and ptr into outer
state before all subsequent calls complete, so failures in cudaMallocAsync,
fun(ptr, stream), or cudaEventRecord can leak the device pointer and event;
change the routine to keep local variables (e.g., local cudaEvent_t
init_event_local and void* ptr_local) and perform cudaEventCreateWithFlags,
cudaMallocAsync, fun(ptr_local, stream), and cudaEventRecord on those locals,
and only after all succeed assign value = std::make_tuple(ptr_local, stream) and
ready.store(...); on any failure catch/handle by destroying/freeing the local
event and freeing ptr_local (using RAII wrappers or explicit
cudaEventDestroy/cudaFreeAsync calls) before rethrowing so no GPU resources are
leaked.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c015c666-28b0-4923-aa41-e5060846f253

📥 Commits

Reviewing files that changed from the base of the PR and between 63ed6ec and d48e73e.

📒 Files selected for processing (1)

cpp/src/neighbors/detail/cagra/compute_distance.hpp

coderabbitai · 2026-05-13T17:54:17Z

+        RAFT_CUDA_TRY(cudaEventCreateWithFlags(&init_event, cudaEventDisableTiming));
        RAFT_CUDA_TRY(cudaMallocAsync(&ptr, size, stream));
        fun(ptr, stream);
+        // Record an event after initialization so that other streams can establish
+        // a GPU-side dependency without expensive host synchronization.
+        RAFT_CUDA_TRY(cudaEventRecord(init_event, stream));
        value = std::make_tuple(ptr, stream);
        ready.store(true, std::memory_order_release);


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle partial init failures without leaking event/device memory

If initialization fails after Line 242 (e.g., Line 243, Line 244, or Line 247), value stays in init_f, so the allocated ptr is not reclaimed, and a later retry can overwrite init_event and leak the previous handle. Please keep event/pointer local until success, and clean both in a failure path before rethrow.

Proposed fix

void eval(rmm::cuda_stream_view stream) { std::lock_guard<std::mutex> lock(mutex); if (std::holds_alternative<init_f>(value)) { auto& [fun, size] = std::get<init_f>(value); dev_descriptor_t* ptr = nullptr; - RAFT_CUDA_TRY(cudaEventCreateWithFlags(&init_event, cudaEventDisableTiming)); - RAFT_CUDA_TRY(cudaMallocAsync(&ptr, size, stream)); - fun(ptr, stream); - // Record an event after initialization so that other streams can establish - // a GPU-side dependency without expensive host synchronization. - RAFT_CUDA_TRY(cudaEventRecord(init_event, stream)); - value = std::make_tuple(ptr, stream); - ready.store(true, std::memory_order_release); + cudaEvent_t local_event{nullptr}; + try { + RAFT_CUDA_TRY(cudaEventCreateWithFlags(&local_event, cudaEventDisableTiming)); + RAFT_CUDA_TRY(cudaMallocAsync(&ptr, size, stream)); + fun(ptr, stream); + // Record an event after initialization so that other streams can establish + // a GPU-side dependency without expensive host synchronization. + RAFT_CUDA_TRY(cudaEventRecord(local_event, stream)); + init_event = local_event; + value = std::make_tuple(ptr, stream); + ready.store(true, std::memory_order_release); + } catch (...) { + if (ptr != nullptr) { RAFT_CUDA_TRY_NO_THROW(cudaFreeAsync(ptr, stream)); } + if (local_event != nullptr) { RAFT_CUDA_TRY_NO_THROW(cudaEventDestroy(local_event)); } + throw; + } } }

As per coding guidelines: "Device memory allocations (cudaMalloc, RMM functions) must have corresponding deallocations; use RAII patterns ... to prevent GPU memory leaks on error paths".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/neighbors/detail/cagra/compute_distance.hpp` around lines 242 - 249, The code currently assigns init_event and ptr into outer state before all subsequent calls complete, so failures in cudaMallocAsync, fun(ptr, stream), or cudaEventRecord can leak the device pointer and event; change the routine to keep local variables (e.g., local cudaEvent_t init_event_local and void* ptr_local) and perform cudaEventCreateWithFlags, cudaMallocAsync, fun(ptr_local, stream), and cudaEventRecord on those locals, and only after all succeed assign value = std::make_tuple(ptr_local, stream) and ready.store(...); on any failure catch/handle by destroying/freeing the local event and freeing ptr_local (using RAII wrappers or explicit cudaEventDestroy/cudaFreeAsync calls) before rethrowing so no GPU resources are leaked.

Hmm. This is a good point, IMHO. If the cudaEventRecord line throws, that's the last we'll see of the ptr allocation. :/

@viclafargue, do we have an RAII-wrapper specifically for cases like this one?

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

achirkin · 2026-06-29T09:16:49Z

@viclafargue that's funny I completely forgot about this PR and that I reviewed it and I arrived at almost the same solution in #2237

viclafargue · 2026-06-29T11:05:02Z

@viclafargue that's funny I completely forgot about this PR and that I reviewed it and I arrived at almost the same solution in #2237

Seems like this may be the correct solution then. It looks like yours is already approved, so we can merge it.

Fix MG CAGRA search OOB

9c648fa

viclafargue requested a review from a team as a code owner February 9, 2026 10:49

github-project-automation Bot added this to Unstructured Data Processing Feb 9, 2026

achirkin reviewed Feb 9, 2026

View reviewed changes

viclafargue moved this to In Progress in Unstructured Data Processing Feb 9, 2026

viclafargue self-assigned this Feb 9, 2026

viclafargue added bug Something isn't working non-breaking Introduces a non-breaking change labels Feb 9, 2026

adressing review

49a24f4

mythrocks reviewed Feb 11, 2026

View reviewed changes

viclafargue requested review from achirkin and mythrocks February 13, 2026 16:38

divyegala mentioned this pull request Feb 16, 2026

JIT LTO Cagra Search #1807

Merged

8 tasks

mohanprasand-nuvai mentioned this pull request Mar 23, 2026

feat: Rust API safety, serialization, filtering, and upstream bug fixes Nuvai/cuvs#1

Merged

7 tasks

Merge branch 'main' into fix-mg-cagra-search-oob

d48e73e

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Uh oh!

Conversation

viclafargue commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

achirkin left a comment

Choose a reason for hiding this comment

Uh oh!

achirkin Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

viclafargue Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

mythrocks Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

viclafargue Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented May 13, 2026

Summary by CodeRabbit

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mythrocks May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

achirkin commented Jun 29, 2026

Uh oh!

viclafargue commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

viclafargue commented Feb 9, 2026 •

edited

Loading

coderabbitai Bot May 13, 2026 •

edited

Loading