Concurrency hardening for BfTree parallel workloads by JordanMaples · Pull Request #1158 · microsoft/DiskANN

JordanMaples · 2026-06-12T21:09:27Z

Problem

NeighborProvider::append_vector performs a read-modify-write cycle (read existing neighbors → append → write back) without synchronization. Under concurrent mutation, the last writer wins and silently drops the other's edges. Stress testing shows 11-51% edge loss under 8-thread contention on a single vertex.

Additionally, the dual-store SetElement wrote quant before full-precision, meaning if the full-precision write failed after quant succeeded, it would leave a quantized ghost with no backing data. And Delete did not clean up the quant store entry.

Solution

Striped RwLock on NeighborProvider

16,384 stripes with Fibonacci multiply-shift hashing (⌊2⁶⁴/φ⌋)
Constant ~128 KB memory regardless of dataset size — suitable for billion-scale workloads
RwLock allows concurrent readers (search) while only writers serialize
Read lock on get_neighbors, write lock on set_neighbors/append_vector/delete_vector

Why not per-vertex locks?

BfTree's value proposition is spilling to disk for datasets exceeding RAM. At 1B vectors, per-vertex RwLock<()> would consume 8 GB of in-memory locks — defeating the purpose. Striped locks keep memory constant at the cost of rare false contention (0.006% collision rate at 16K stripes).

Dual-store write ordering

Full-precision is now written first in SetElement
Failure mode is benign: vector reachable in full-precision but not yet in quantized search

Quant store cleanup on delete

DeleteQuant trait with real impl for QuantVectorProvider, no-op for NoStore
Single generic Delete impl; neighbor topology cleanup remains in the upper DiskANNIndex layer

Testing

3 new concurrent stress tests: - test_concurrent_append_no_lost_edges — 8 threads × 10 edges to same vertex
test_concurrent_append_independent_vertices — 8 threads to different vertices
test_concurrent_read_write_consistency — mixed readers + writers, validates no torn reads

Copilot

Pull request overview

This PR hardens diskann-bftree for concurrent workloads by synchronizing neighbor-list mutations (preventing lost updates), correcting dual-store (full + quant) lifecycle behavior, and adding concurrent stress tests to validate the new behavior.

Changes:

Add striped per-vertex RwLock synchronization to NeighborProvider (read locks for reads, write locks for mutations) and introduce concurrent stress tests.
Adjust hard-delete behavior to also delete quantized vectors when present via a small abstraction over the quant store.
Reorder dual-store writes in set_element to write full-precision first, then quantized.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
diskann-bftree/src/neighbors.rs	Adds striped locking around neighbor-list read/modify/write and adds concurrent stress tests.
diskann-bftree/src/provider.rs	Adds quant-store cleanup on delete via `DeleteQuant` and reorders full/quant write sequencing in `set_element`.
diskann-bftree/src/quant.rs	Adds quant-store delete support used by hard-delete cleanup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

harsha-simhadri · 2026-06-13T01:06:49Z

Jordan, what is the perf impact of this change? Is ingestion speed the same as before? thanks

Add striped RwLock to NeighborProvider to eliminate the TOCTOU race in append_vector's read-modify-write cycle. Under 8-thread contention on the same vertex, the unprotected path loses 11-51% of edges. Striped locks (16384 stripes, Fibonacci multiply-shift hash): - Constant ~128 KB memory regardless of dataset size (vs ~8 GB at 1B vectors for per-vertex locks) - RwLock allows concurrent readers during search; only writers serialize - Read lock on get_neighbors, write lock on set/append/delete - Internal get_neighbors_unlocked avoids deadlock in append_vector Dual-store write ordering (SetElement for QuantVectorProvider): - Write full-precision (authoritative) before quant - If quant write fails, worst case is a vector missing from quantized search but still reachable in full-precision (benign) Hard-delete cleanup (Delete trait): - Added DeleteQuant trait; QuantVectorProvider deletes its entry, NoStore is a no-op - Single generic Delete impl with Q: DeleteQuant bound - Neighbor adjacency cleanup handled by upper DiskANNIndex layer Tests: - 3 concurrent stress tests proving the TOCTOU fix - Verified 10/10 failures without locks, 10/10 passes with locks Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

codecov-commenter · 2026-06-15T15:58:55Z

Codecov Report

❌ Patch coverage is 96.91630% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.48%. Comparing base (1bdf1a1) to head (566d5ee).
⚠️ Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
diskann-bftree/src/neighbors.rs	97.72%	4 Missing ⚠️
diskann-bftree/src/provider.rs	88.46%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1158      +/-   ##
==========================================
+ Coverage   89.46%   89.48%   +0.01%     
==========================================
  Files         487      488       +1     
  Lines       92102    92378     +276     
==========================================
+ Hits        82401    82664     +263     
- Misses       9701     9714      +13

Flag	Coverage Δ
miri	`89.48% <96.91%> (+0.01%)`	⬆️
unittests	`89.13% <96.91%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
diskann-bftree/src/lib.rs	`47.36% <ø> (ø)`
diskann-bftree/src/locks.rs	`100.00% <100.00%> (ø)`
diskann-bftree/src/quant.rs	`89.69% <100.00%> (+0.17%)`	⬆️
diskann-bftree/src/provider.rs	`90.85% <88.46%> (-0.12%)`	⬇️
diskann-bftree/src/neighbors.rs	`95.33% <97.72%> (+1.38%)`	⬆️

... and 74 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Move StripedLocks from NeighborProvider to BfTreeProvider so that delete and set_element operations are synchronized under the same lock table as neighbor mutations. This eliminates a potential race between concurrent delete and set_element on the same vertex ID. - Extract StripedLocks into its own module (locks.rs) - BfTreeProvider owns Arc<StripedLocks>, shared with NeighborAccessor - Dynamic stripe count based on available_parallelism (4x cores, min 64) - delete() and set_element() now acquire write locks per vertex Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

JordanMaples · 2026-06-16T20:24:37Z

@copilot give it another review.

Copilot · 2026-06-16T20:38:17Z

@copilot give it another review.

I re-reviewed and found one remaining concurrency issue: beam expansion paths were reading neighbors without taking the striped read lock. I fixed both call sites in commit 562c341 by acquiring self.provider.locks.read(n.into_usize()) before get_neighbors.

Validation run:

cargo test -p diskann-bftree --profile ci ✅
cargo clippy -p diskann-bftree --all-targets --config 'build.rustflags=["-Dwarnings"]' -- -D warnings ✅
Copilot code review ✅ (no additional comments)
CodeQL scan timed out in the validation tool (no rerun per tool guidance).

JordanMaples · 2026-06-17T18:12:21Z

@harsha-simhadri after chatting a bit with Mark and and changing the approach to locking in the provider. I ran a few more wiki100k streaming benchmarks on my devbox instance it looks like the perf overhead is negligible.

JordanMaples requested review from a team and Copilot June 12, 2026 21:09

Copilot started reviewing on behalf of JordanMaples June 12, 2026 21:09 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Comment thread diskann-bftree/src/provider.rs

Comment thread diskann-bftree/src/neighbors.rs Outdated

Comment thread diskann-bftree/src/provider.rs

Comment thread diskann-bftree/src/provider.rs

JordanMaples force-pushed the jordanmaples/bftree_parallelism branch from aea01c1 to 04a1dd8 Compare June 15, 2026 15:34

JordanMaples force-pushed the jordanmaples/bftree_parallelism branch from 04a1dd8 to c94cdcc Compare June 15, 2026 15:44

hildebrandmw reviewed Jun 16, 2026

View reviewed changes

Comment thread diskann-bftree/src/neighbors.rs Outdated

Comment thread diskann-bftree/src/provider.rs

Copilot started work on behalf of JordanMaples June 16, 2026 20:25 View session

Copilot finished work on behalf of JordanMaples June 16, 2026 20:38

remove locks for reads to rely on bf-tree's internal locks

535d925

JordanMaples force-pushed the jordanmaples/bftree_parallelism branch from 562c341 to 535d925 Compare June 17, 2026 16:23

switching to mutex as we are not explicitly locking for reads

d07886c

hildebrandmw approved these changes Jun 17, 2026

View reviewed changes

Comment thread diskann-bftree/src/neighbors.rs Outdated

arc to ref

566d5ee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrency hardening for BfTree parallel workloads#1158

Concurrency hardening for BfTree parallel workloads#1158
JordanMaples wants to merge 5 commits into
mainfrom
jordanmaples/bftree_parallelism

JordanMaples commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

harsha-simhadri commented Jun 13, 2026

Uh oh!

codecov-commenter commented Jun 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

JordanMaples commented Jun 16, 2026

Uh oh!

Copilot AI commented Jun 16, 2026

Uh oh!

JordanMaples commented Jun 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

JordanMaples commented Jun 12, 2026

Problem

Solution

Striped RwLock on NeighborProvider

Why not per-vertex locks?

Dual-store write ordering

Quant store cleanup on delete

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

harsha-simhadri commented Jun 13, 2026

Uh oh!

codecov-commenter commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

JordanMaples commented Jun 16, 2026

Uh oh!

Copilot AI commented Jun 16, 2026

Uh oh!

JordanMaples commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov-commenter commented Jun 15, 2026 •

edited

Loading

JordanMaples commented Jun 17, 2026 •

edited

Loading