Skip to content

fix(concurrency): guard 20+ race conditions across getOrCreate, balance updates, and view permissions#2845

Merged
simonredfern merged 31 commits into
OpenBankProject:developfrom
hongwei1:feature/concurrency-hazard-tests
Jun 24, 2026
Merged

fix(concurrency): guard 20+ race conditions across getOrCreate, balance updates, and view permissions#2845
simonredfern merged 31 commits into
OpenBankProject:developfrom
hongwei1:feature/concurrency-hazard-tests

Conversation

@hongwei1

@hongwei1 hongwei1 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary

This branch fixes 20+ concurrency race conditions identified through systematic hazard analysis and concurrency test suite execution. All 20 scenarios in CONCURRENCY_HAZARDS.md are now in the Gracefully handled tier.

Core fixes

  • Atomic balance updates — replace Lift Mapper read-modify-write with Doobie SELECT FOR UPDATE row-level lock (DoobieBankAccountQueries)
  • MFA double-spend preventionlockTransactionRequest now uses runUpdate (Strategy.default) instead of runQuery (Strategy.void) so the FOR UPDATE lock is held through the transaction on the fallback path
  • getOrCreate races — wrap all saveMe() in tryo + re-fetch on Failure across 9 providers: OAuth Consumer, SystemView, ViewPermission, AccountHolder, Entitlement, mapping providers (Customer / Transaction / Account), and BadLoginStatus
  • Scheduler stale-save — consent scheduler uses conditional UPDATE WHERE mStatus = guardStatus; removed direct mStatus(...).saveMe() that could resurrect revoked consents
  • View permission races — guard addPermission, addPermissions, revokePermission, and migrateViewPermissions against duplicate inserts; fix removeAllViewsAndVierPermissions to explicitly delete AccountAccess before bulkDelete_!! (which bypasses beforeDelete hooks)
  • Future counter atomicityincrementFutureCounter / decrementFutureCounter use UPDATE ... SET n = n ± 1 instead of read-increment-write

Post-review correctness fixes (5 additional)

  • lockTransactionRequest: runQueryrunUpdate (critical: lock was released before payment logic ran on non-request-scoped paths)
  • OAuth getOrCreateConsumer re-fetch: use (azp, sub) match; fall back to key when either is absent to avoid matching unrelated consumers via empty-string
  • removeAllViewsAndVierPermissions: explicit AccountAccess.bulkDelete_!! before ViewDefinition.bulkDelete_!!
  • LoginAttempts.incrementBadLoginAttempts: wrap first-time create.save in tryo
  • migrate_00000020.sql: dedup prerequisite for new UniqueIndex additions on mapperaccountholder and mappedentitlement

Codebase-wide audit (4 additional mapping provider fixes)

Full scan of the codebase revealed the same unguarded getOrCreate pattern in three ID-mapping providers and getOrCreateBadLoginStatus. All four fixed with the same tryo + re-fetch pattern.

Test plan

  • All 20 scenarios in obp-api/src/test/scala/code/concurrency/CONCURRENCY_HAZARDS.md pass
  • ConcurrentViewPermissionRaceTest — 4 scenarios (N, O, R, migrate)
  • CI shards 1–4 green
  • Run migrate_00000020.sql against any DB with pre-existing data before deploying

hongwei1 added 18 commits June 11, 2026 15:25
Adds a tagged ScalaTest suite that simulates 16 confirmed database
concurrency hazards in OBP-API: lost-update, check-then-act,
check-then-insert, unique-constraint-unhandled, and counter-sequence
races across money movement, security, consent scheduling, view
permissions, and OAuth user creation paths.

Each scenario asserts the theoretically-correct outcome so that a
hazard surfaces as a FAILED test (red bar = evidence the hazard is
real). Two safeguard scenarios (connection pool back-pressure,
per-request context isolation) are verified to PASS.

All scenarios are tagged ConcurrencyRace and excluded from the CI
main flow. See CONCURRENCY_HAZARDS.md for the full taxonomy, source
locations, and three-tier protection analysis.

Run only these tests:
  mvn -pl obp-commons,obp-api scalatest:test \
    -DtagsToInclude=code.concurrency.ConcurrencyRace \
    -DfailIfNoTests=false
…vel locking

Replace non-atomic read-modify-write counter updates in LoginAttempt.incrementBadLoginAttempts
and MappedChallengeProvider.validateChallenge with atomic Doobie SQL operations using
SELECT FOR UPDATE + UPDATE to serialize concurrent access at the database level.

Previously, concurrent requests all read the same starting counter value and each wrote
back an identical incremented value (lost-update), allowing attackers to:
- Send concurrent bad-login attempts without consuming the lockout threshold (Scenario H)
- Submit concurrent wrong challenge answers without exhausting the allowed-attempt budget (Scenario K)

The SELECT FOR UPDATE acquires an exclusive row lock before the UPDATE, forcing subsequent
transactions to wait and then re-read the current committed counter value rather than
operating on their snapshot value. Verified against H2 (MVCC/MVStore) in concurrency tests.
…consents

Replace unconditional .save calls in ConsentScheduler with conditional Doobie
UPDATEs that guard on the expected current status. If the status was changed
by a concurrent HTTP request between the scheduler's findAll and its update,
the WHERE clause does not match and the stale write is silently skipped (0 rows
updated), preserving the terminal status committed by the HTTP path.

Affected scheduler tasks:
- expiredBerlinGroupConsents: guards on mstatus IN ('valid','VALID')
- unfinishedBerlinGroupConsents: guards on mstatus = 'received'
- expiredObpConsents: guards on mstatus = 'ACCEPTED'

Fixes ConcurrentConsentRaceTest scenarios J and U.
… methods

Add UniqueIndex constraints where missing and wrap saveMe() calls in
scala.util.Try or tryo, re-fetching the committed row on constraint
violation so concurrent callers always get a usable result rather than
an unhandled JdbcSQLIntegrityConstraintViolationException.

Affected methods:
- MappedEntitlement.addEntitlement (C): UniqueIndex(bankId,userId,roleName) + tryo retry
- MapperAccountHolders.getOrCreateAccountHolder (D): UniqueIndex(user,bankPermalink,accPermalink) + tryo retry
- MapperCounterparties.getOrCreateMetadata (F): tryo retry on existing UniqueIndex(counterpartyId)
- OAuth.getOrCreateConsumer (W): re-fetch on Failure from existing tryo block
- MappedUserCustomerLink.getOCreateUserCustomerLink (L): Try retry on existing UniqueIndex(userId,customerId)
- LiftUsers.getOrCreateUserByProviderId (I): Try retry on existing UniqueIndex(provider,providerId)
…ccess rows

N: getOrCreateCustomPublicView — wrap createDefaultCustomPublicView in
   scala.util.Try; on constraint violation re-fetch the committed row.
   Prevents uncaught JdbcSQLIntegrityConstraintViolationException when two
   concurrent callers both see Empty and both try to insert the _public view.

O: resetViewPermissions — wrap ViewPermission.create.save in scala.util.Try
   so a concurrent reset that already inserted the same (bank,account,view,
   permission) row is handled gracefully instead of throwing.

R: ViewDefinition.beforeDelete cascade — add a beforeDelete hook that deletes
   all AccountAccess rows for the view before the view row is removed. Closes
   the TOCTOU window in removeCustomView's empty-check-then-delete sequence:
   any AccountAccess granted in that window is cleaned up atomically with the
   view deletion rather than orphaned.
…ter atomic

Replace the non-atomic getOrDefault + put pair with ConcurrentHashMap.compute,
which holds the internal segment lock for the duration of the read-modify-write.
This prevents concurrent callers from reading the same stale tuple and
overwriting each other's increments.
…ons against duplicate inserts

- MapperViews.getOrCreateSystemView: wrap createDefaultSystemView in scala.util.Try;
  re-fetch the committed row on constraint violation (mirrors the N fix for custom views)
- MapperViews.migrateViewPermissions: wrap the four bare ViewPermission.create.save call
  sites (system-view and custom-view branches for both boolean and list-valued permissions)
  in scala.util.Try; duplicate inserts from concurrent migrations are silently absorbed
- ConcurrentViewPermissionRaceTest: add 'migrate' scenario — two threads concurrently
  call migrateViewPermissions on a fresh custom view; asserts no exception and exactly
  one ViewPermission row per enabled permission
- CONCURRENCY_HAZARDS.md: update to 20 scenarios; document M/P/migrateViewPermissions
  as fixed; add migrate row to the View Permissions results table
- DoobieTransactionRequestQueries: use runUpdate instead of runQuery for
  lockTransactionRequest so SELECT FOR UPDATE holds the row lock on the
  fallback (non-request-context) path; runQuery's Strategy.void transactor
  released the lock before the caller's payment logic could run

- OAuth.getOrCreateConsumer: re-fetch after UniqueIndex failure now uses
  (azp, sub) only when both are present; when either is absent falls back
  to re-fetch by key to avoid matching an unrelated consumer stored with
  empty azp+sub fields

- MapperViews.removeAllViewsAndVierPermissions: explicitly delete
  AccountAccess rows before ViewDefinition.bulkDelete_!! because
  bulkDelete_!! bypasses beforeDelete hooks, leaving orphaned rows

- LoginAttempts.incrementBadLoginAttempts: wrap the first-time
  MappedBadLoginAttempt create in tryo so concurrent first-time bad
  logins don't propagate an uncaught JDBC UniqueIndex exception

- Add migrate_00000020.sql: dedup mapperaccountholder and mappedentitlement
  rows before the new UniqueIndex additions can be applied by Schemifier;
  without prior dedup Schemifier's CREATE UNIQUE INDEX aborts on any
  production DB that accumulated duplicate rows before this fix
…n status against duplicate inserts

Three ID-mapping providers (Customer, Transaction, Account) and
getOrCreateBadLoginStatus in LoginAttempts had the same check-then-insert
race: two concurrent callers both saw Empty, both called saveMe(), and
the losing thread threw an uncaught UniqueIndex violation.

Apply the standard tryo + re-fetch pattern to all four methods:
- wrap saveMe in tryo to absorb the UniqueIndex violation
- on Failure, re-fetch by the natural key to return the committed row
- getOrCreateBadLoginStatus: replace eager .or(Full(saveMe())) with
  an explicit match so saveMe is only called when find returns non-Full
@hongwei1 hongwei1 changed the title Feature/concurrency hazard tests fix(concurrency): guard 20+ race conditions across getOrCreate, balance updates, and view permissions Jun 23, 2026
… providers

- MappedCustomerIdMappingProvider: `case other => other.map(_.customerId)`
  fixes Box[MappedCustomerIdMapping] returned where Box[CustomerId] required
- MappedTransactionIdMappingProvider: same fix → `.map(_.transactionId)`
- MappedAccountIdMappingProvider: same fix → `.map(_.accountId)`
- LoginAttempts: add `Failure` to the net.liftweb.common import so the
  inner tryo match arm compiles (was: not found: value Failure)
@hongwei1 hongwei1 closed this Jun 23, 2026
@hongwei1 hongwei1 reopened this Jun 23, 2026
hongwei1 added 8 commits June 23, 2026 17:37
Replace opaque (c, o) tuple destructuring with (totalCallCount, openFuturesCount)
in incrementFutureCounter and decrementFutureCounter. Add inline comments explaining
the atomicity guarantee from ConcurrentHashMap.compute, the role of each counter, and
the null-guard initialisation behaviour in the decrement path.
The per-permission boolean columns on ViewDefinition (canSeeTransactionAmount_,
canSeeTransactionCurrency_, ...) were superseded by the normalized ViewPermission
table. Production already reads via allowed_actions and writes via
ViewPermission.resetViewPermissions, leaving the columns as a legacy data source
read only by the one-time migrateViewPermissions bridge.

- Reimplement the boolean permission accessors to derive from allowed_actions,
  making ViewPermission the single source of truth.
- Comment out the boolean permission MappedBoolean columns; Schemifier stops
  managing them while the physical columns remain for a later SQL drop. The
  MappedText list columns (canGrantAccessToViews_ / canRevokeAccessToViews_) stay.
- Remove the migrateViewPermissions bridge, the obsolete MigrationOf* view
  permission scripts, and their boot wiring in Migration.scala.
- Drop the now-moot migrate concurrency scenario and its CONCURRENCY_HAZARDS.md
  entry, since the migrated method no longer exists.

Closes #26
Replace the manual DBA script migrate_00000020.sql with an automatic
pre-Schemifier dedup step called unconditionally in Boot.scala.

The concurrency-hazard fix added UniqueIndex to MapperAccountHolders
and MappedEntitlements. Schemifier issues CREATE UNIQUE INDEX without a
prior duplicate check, so an existing DB with duplicates would fail to
start. The SQL file was a manual workaround.

The replacement adds Migration.database.deduplicateBeforeUniqueIndexSchemify()
directly before schemifyAll() in Boot.scala. It is not gated by
migration_scripts.* props (Schemifier itself is not gated), runs as a
no-op on fresh/test DBs (table-absent check), and probes for duplicates
before issuing the DELETE so it is cheap on healthy databases. On an
existing database with duplicates it removes all but the lowest-id row
per natural key before the UniqueIndex DDL runs.

Also adds DbFunction.tableExistsByName(tableName: String) as a
string-based counterpart to the existing tableExists(BaseMetaMapper)
helper, to avoid importing the mapper objects inside Migration.
…ition

The per-permission MappedBoolean columns were retired in issue #26 and
left as a /* */ block comment to allow Schemifier to stop managing them
while keeping the physical DB columns intact. Now that the approach has
been verified in production, remove the dead comment block entirely.

The two MappedText list columns (canGrantAccessToViews_ /
canRevokeAccessToViews_) are intentionally kept — they serve a
different purpose and are still active.
The request layer migrated from Lift Web to http4s, but the comments
still described connection sharing in terms of "Lift HTTP request
context". Update the terminology to http4s request scope and rename
the private helper liftCurrentConnection -> currentRequestConnection.
References to Lift Mapper's DB API (DB.currentConnection / DB.runQuery)
are kept — Mapper remains the ORM.
Remove the obsolete ScalaTest simulation plan
(2026-06-11-10-22-obp-api-api-tidy-robin.md). It was a pre-work planning
artifact describing a superseded state (7 hazards A-G, account-holder
still a non-unique Index, fixes listed as future work). It is referenced
nowhere and is fully superseded by CONCURRENCY_HAZARDS.md, which
documents the delivered suite (19 scenarios, all fixed).

Salvage the one piece of lasting value the plan held that the reference
lacked: the H2 in-memory reproduction caveats (DB-isolation-independent
hazards, table-lock masking, the asymmetric "reproduced on H2 => Postgres
has it; not reproduced != safe" conclusion, and the dispatch HttpClient
pool retry note) — now a "Testing Notes" section in CONCURRENCY_HAZARDS.md.
…azard-tests

# Conflicts:
#	obp-api/src/main/scala/code/api/util/migration/Migration.scala
…ion framework

deduplicateBeforeUniqueIndexSchemify() is the only schema-touching call in
Boot that runs outside Migration.database.executeScripts, and the call site
had no explanation. With the migration README/scaladoc now establishing that
schema changes go through the runOnce framework, a future reader could move
this into executeScripts and silently break it.

Document at both the Boot.scala call site and the method: it MUST run before
schemifyAll() (both executeScripts passes run after Schemifier, despite the
historical "before Schemifier" comment) and ungated by migration_scripts.*
props (off in tests, while Schemifier creates the index in every environment).
No logic change.
hongwei1 added 4 commits June 24, 2026 10:21
The scripts/migrate/*.sql files were manual DBA runbooks for upgrading
existing databases by hand. Nothing executes them: there is no
Flyway/Liquibase/RUNSCRIPT loader, no code or build reference, and
src/main/scripts is not packaged into the artifact. Schema evolution is
handled by Lift Schemifier plus the runOnce MigrationOf* framework in
Migration.scala, which supersedes these one-off scripts.

Removing them keeps the migration story in one place (Migration.scala +
the migration README) and avoids implying these need to be run or
extended. History is preserved in git if a legacy manual upgrade path is
ever needed.
… boot sequence

Both executeScripts passes run AFTER schemifyAll(); the comment and log
claiming the existing-DB pass runs "before Lift Mapper Schemifier" have
been stale since 2021 (commit ea45370 moved schemifyAll ahead of both
passes). The startedBeforeSchemifier flag selects the existing-DB pass,
in which post-Schemifier-dependent migrations skip themselves — it does
not change call order relative to Schemifier.

Reword the Boot comment and the existing-DB log line, and document the
real semantics of the flag on Migration.database.executeScripts. The
fresh-DB log line is already accurate and is left unchanged. No behaviour
change.
…e docs

The resetPasswordUrl description had a raw <email> placeholder and two
dynamic-entity-backup descriptions had a raw <BackupEntityName>
placeholder. When the resource docs are serialized and parsed as XML
(ResourceDocsTest "getResourceDocsObp Api -v6.0.0"), these read as
unterminated XML start tags and throw SAXParseException, failing the
ResourceDocs CI shard.

Replace <email> with the concrete example user@example.com (already the
idiomatic placeholder elsewhere in this file) and <BackupEntityName> with
{BackupEntityName}. Documentation text only.
… DB vendors

The natural-key dedup ran DELETE ... WHERE id NOT IN (SELECT MIN(id) FROM
sameTable GROUP BY ...). That survivor-set form names the target table inside the
DELETE subquery, which MySQL/MariaDB reject with ERROR 1093, and it scales with
table size rather than with the number of duplicates. MySQL is a first-class OBP
target (driver shipped, per-vendor SQL branches throughout this package), so the
form could abort boot on a MySQL deployment while passing on H2/Postgres.

Switch to a delete-set form: a derived table over
ROW_NUMBER() OVER (PARTITION BY <natural key> ORDER BY id ASC) that targets only
the surplus rows. The derived table is materialised, which sidesteps 1093, and
window functions are supported by every driver OBP ships (PostgreSQL, H2 2.x,
MySQL 8+/MariaDB 10.2+, SQL Server, Oracle). The survivor is unchanged: the lowest
id per natural-key group.

Also replace execute() with executeUpdate() so the destructive cleanup logs how
many rows it removed, and document the lossy survivor policy: discarded duplicates'
mentitlementid UUIDs and provenance columns are dropped, while the surviving row
still encodes the same grant/ownership.

Verified the new SQL on H2 2.2.220 (the shipped version): it keeps the lowest id
per group, including for NULL partition keys.
@simonredfern simonredfern merged commit aad6300 into OpenBankProject:develop Jun 24, 2026
11 checks passed
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants