fix(concurrency): guard 20+ race conditions across getOrCreate, balance updates, and view permissions#2845
Merged
simonredfern merged 31 commits intoJun 24, 2026
Conversation
Adds a tagged ScalaTest suite that simulates 16 confirmed database
concurrency hazards in OBP-API: lost-update, check-then-act,
check-then-insert, unique-constraint-unhandled, and counter-sequence
races across money movement, security, consent scheduling, view
permissions, and OAuth user creation paths.
Each scenario asserts the theoretically-correct outcome so that a
hazard surfaces as a FAILED test (red bar = evidence the hazard is
real). Two safeguard scenarios (connection pool back-pressure,
per-request context isolation) are verified to PASS.
All scenarios are tagged ConcurrencyRace and excluded from the CI
main flow. See CONCURRENCY_HAZARDS.md for the full taxonomy, source
locations, and three-tier protection analysis.
Run only these tests:
mvn -pl obp-commons,obp-api scalatest:test \
-DtagsToInclude=code.concurrency.ConcurrencyRace \
-DfailIfNoTests=false
…into feature/concurrency-hazard-tests
…json4s in concurrency tests
…locking and fallback transactor
…lect global Http4s proxy usage
…vel locking Replace non-atomic read-modify-write counter updates in LoginAttempt.incrementBadLoginAttempts and MappedChallengeProvider.validateChallenge with atomic Doobie SQL operations using SELECT FOR UPDATE + UPDATE to serialize concurrent access at the database level. Previously, concurrent requests all read the same starting counter value and each wrote back an identical incremented value (lost-update), allowing attackers to: - Send concurrent bad-login attempts without consuming the lockout threshold (Scenario H) - Submit concurrent wrong challenge answers without exhausting the allowed-attempt budget (Scenario K) The SELECT FOR UPDATE acquires an exclusive row lock before the UPDATE, forcing subsequent transactions to wait and then re-read the current committed counter value rather than operating on their snapshot value. Verified against H2 (MVCC/MVStore) in concurrency tests.
…consents
Replace unconditional .save calls in ConsentScheduler with conditional Doobie
UPDATEs that guard on the expected current status. If the status was changed
by a concurrent HTTP request between the scheduler's findAll and its update,
the WHERE clause does not match and the stale write is silently skipped (0 rows
updated), preserving the terminal status committed by the HTTP path.
Affected scheduler tasks:
- expiredBerlinGroupConsents: guards on mstatus IN ('valid','VALID')
- unfinishedBerlinGroupConsents: guards on mstatus = 'received'
- expiredObpConsents: guards on mstatus = 'ACCEPTED'
Fixes ConcurrentConsentRaceTest scenarios J and U.
… methods Add UniqueIndex constraints where missing and wrap saveMe() calls in scala.util.Try or tryo, re-fetching the committed row on constraint violation so concurrent callers always get a usable result rather than an unhandled JdbcSQLIntegrityConstraintViolationException. Affected methods: - MappedEntitlement.addEntitlement (C): UniqueIndex(bankId,userId,roleName) + tryo retry - MapperAccountHolders.getOrCreateAccountHolder (D): UniqueIndex(user,bankPermalink,accPermalink) + tryo retry - MapperCounterparties.getOrCreateMetadata (F): tryo retry on existing UniqueIndex(counterpartyId) - OAuth.getOrCreateConsumer (W): re-fetch on Failure from existing tryo block - MappedUserCustomerLink.getOCreateUserCustomerLink (L): Try retry on existing UniqueIndex(userId,customerId) - LiftUsers.getOrCreateUserByProviderId (I): Try retry on existing UniqueIndex(provider,providerId)
…ccess rows N: getOrCreateCustomPublicView — wrap createDefaultCustomPublicView in scala.util.Try; on constraint violation re-fetch the committed row. Prevents uncaught JdbcSQLIntegrityConstraintViolationException when two concurrent callers both see Empty and both try to insert the _public view. O: resetViewPermissions — wrap ViewPermission.create.save in scala.util.Try so a concurrent reset that already inserted the same (bank,account,view, permission) row is handled gracefully instead of throwing. R: ViewDefinition.beforeDelete cascade — add a beforeDelete hook that deletes all AccountAccess rows for the view before the view row is removed. Closes the TOCTOU window in removeCustomView's empty-check-then-delete sequence: any AccountAccess granted in that window is cleaned up atomically with the view deletion rather than orphaned.
…ter atomic Replace the non-atomic getOrDefault + put pair with ConcurrentHashMap.compute, which holds the internal segment lock for the duration of the read-modify-write. This prevents concurrent callers from reading the same stale tuple and overwriting each other's increments.
…ons against duplicate inserts - MapperViews.getOrCreateSystemView: wrap createDefaultSystemView in scala.util.Try; re-fetch the committed row on constraint violation (mirrors the N fix for custom views) - MapperViews.migrateViewPermissions: wrap the four bare ViewPermission.create.save call sites (system-view and custom-view branches for both boolean and list-valued permissions) in scala.util.Try; duplicate inserts from concurrent migrations are silently absorbed - ConcurrentViewPermissionRaceTest: add 'migrate' scenario — two threads concurrently call migrateViewPermissions on a fresh custom view; asserts no exception and exactly one ViewPermission row per enabled permission - CONCURRENCY_HAZARDS.md: update to 20 scenarios; document M/P/migrateViewPermissions as fixed; add migrate row to the View Permissions results table
- DoobieTransactionRequestQueries: use runUpdate instead of runQuery for lockTransactionRequest so SELECT FOR UPDATE holds the row lock on the fallback (non-request-context) path; runQuery's Strategy.void transactor released the lock before the caller's payment logic could run - OAuth.getOrCreateConsumer: re-fetch after UniqueIndex failure now uses (azp, sub) only when both are present; when either is absent falls back to re-fetch by key to avoid matching an unrelated consumer stored with empty azp+sub fields - MapperViews.removeAllViewsAndVierPermissions: explicitly delete AccountAccess rows before ViewDefinition.bulkDelete_!! because bulkDelete_!! bypasses beforeDelete hooks, leaving orphaned rows - LoginAttempts.incrementBadLoginAttempts: wrap the first-time MappedBadLoginAttempt create in tryo so concurrent first-time bad logins don't propagate an uncaught JDBC UniqueIndex exception - Add migrate_00000020.sql: dedup mapperaccountholder and mappedentitlement rows before the new UniqueIndex additions can be applied by Schemifier; without prior dedup Schemifier's CREATE UNIQUE INDEX aborts on any production DB that accumulated duplicate rows before this fix
…n status against duplicate inserts Three ID-mapping providers (Customer, Transaction, Account) and getOrCreateBadLoginStatus in LoginAttempts had the same check-then-insert race: two concurrent callers both saw Empty, both called saveMe(), and the losing thread threw an uncaught UniqueIndex violation. Apply the standard tryo + re-fetch pattern to all four methods: - wrap saveMe in tryo to absorb the UniqueIndex violation - on Failure, re-fetch by the natural key to return the committed row - getOrCreateBadLoginStatus: replace eager .or(Full(saveMe())) with an explicit match so saveMe is only called when find returns non-Full
… providers - MappedCustomerIdMappingProvider: `case other => other.map(_.customerId)` fixes Box[MappedCustomerIdMapping] returned where Box[CustomerId] required - MappedTransactionIdMappingProvider: same fix → `.map(_.transactionId)` - MappedAccountIdMappingProvider: same fix → `.map(_.accountId)` - LoginAttempts: add `Failure` to the net.liftweb.common import so the inner tryo match arm compiles (was: not found: value Failure)
Replace opaque (c, o) tuple destructuring with (totalCallCount, openFuturesCount) in incrementFutureCounter and decrementFutureCounter. Add inline comments explaining the atomicity guarantee from ConcurrentHashMap.compute, the role of each counter, and the null-guard initialisation behaviour in the decrement path.
The per-permission boolean columns on ViewDefinition (canSeeTransactionAmount_, canSeeTransactionCurrency_, ...) were superseded by the normalized ViewPermission table. Production already reads via allowed_actions and writes via ViewPermission.resetViewPermissions, leaving the columns as a legacy data source read only by the one-time migrateViewPermissions bridge. - Reimplement the boolean permission accessors to derive from allowed_actions, making ViewPermission the single source of truth. - Comment out the boolean permission MappedBoolean columns; Schemifier stops managing them while the physical columns remain for a later SQL drop. The MappedText list columns (canGrantAccessToViews_ / canRevokeAccessToViews_) stay. - Remove the migrateViewPermissions bridge, the obsolete MigrationOf* view permission scripts, and their boot wiring in Migration.scala. - Drop the now-moot migrate concurrency scenario and its CONCURRENCY_HAZARDS.md entry, since the migrated method no longer exists. Closes #26
Replace the manual DBA script migrate_00000020.sql with an automatic pre-Schemifier dedup step called unconditionally in Boot.scala. The concurrency-hazard fix added UniqueIndex to MapperAccountHolders and MappedEntitlements. Schemifier issues CREATE UNIQUE INDEX without a prior duplicate check, so an existing DB with duplicates would fail to start. The SQL file was a manual workaround. The replacement adds Migration.database.deduplicateBeforeUniqueIndexSchemify() directly before schemifyAll() in Boot.scala. It is not gated by migration_scripts.* props (Schemifier itself is not gated), runs as a no-op on fresh/test DBs (table-absent check), and probes for duplicates before issuing the DELETE so it is cheap on healthy databases. On an existing database with duplicates it removes all but the lowest-id row per natural key before the UniqueIndex DDL runs. Also adds DbFunction.tableExistsByName(tableName: String) as a string-based counterpart to the existing tableExists(BaseMetaMapper) helper, to avoid importing the mapper objects inside Migration.
…ition The per-permission MappedBoolean columns were retired in issue #26 and left as a /* */ block comment to allow Schemifier to stop managing them while keeping the physical DB columns intact. Now that the approach has been verified in production, remove the dead comment block entirely. The two MappedText list columns (canGrantAccessToViews_ / canRevokeAccessToViews_) are intentionally kept — they serve a different purpose and are still active.
The request layer migrated from Lift Web to http4s, but the comments still described connection sharing in terms of "Lift HTTP request context". Update the terminology to http4s request scope and rename the private helper liftCurrentConnection -> currentRequestConnection. References to Lift Mapper's DB API (DB.currentConnection / DB.runQuery) are kept — Mapper remains the ORM.
Remove the obsolete ScalaTest simulation plan (2026-06-11-10-22-obp-api-api-tidy-robin.md). It was a pre-work planning artifact describing a superseded state (7 hazards A-G, account-holder still a non-unique Index, fixes listed as future work). It is referenced nowhere and is fully superseded by CONCURRENCY_HAZARDS.md, which documents the delivered suite (19 scenarios, all fixed). Salvage the one piece of lasting value the plan held that the reference lacked: the H2 in-memory reproduction caveats (DB-isolation-independent hazards, table-lock masking, the asymmetric "reproduced on H2 => Postgres has it; not reproduced != safe" conclusion, and the dispatch HttpClient pool retry note) — now a "Testing Notes" section in CONCURRENCY_HAZARDS.md.
…azard-tests # Conflicts: # obp-api/src/main/scala/code/api/util/migration/Migration.scala
…ion framework deduplicateBeforeUniqueIndexSchemify() is the only schema-touching call in Boot that runs outside Migration.database.executeScripts, and the call site had no explanation. With the migration README/scaladoc now establishing that schema changes go through the runOnce framework, a future reader could move this into executeScripts and silently break it. Document at both the Boot.scala call site and the method: it MUST run before schemifyAll() (both executeScripts passes run after Schemifier, despite the historical "before Schemifier" comment) and ungated by migration_scripts.* props (off in tests, while Schemifier creates the index in every environment). No logic change.
The scripts/migrate/*.sql files were manual DBA runbooks for upgrading existing databases by hand. Nothing executes them: there is no Flyway/Liquibase/RUNSCRIPT loader, no code or build reference, and src/main/scripts is not packaged into the artifact. Schema evolution is handled by Lift Schemifier plus the runOnce MigrationOf* framework in Migration.scala, which supersedes these one-off scripts. Removing them keeps the migration story in one place (Migration.scala + the migration README) and avoids implying these need to be run or extended. History is preserved in git if a legacy manual upgrade path is ever needed.
… boot sequence Both executeScripts passes run AFTER schemifyAll(); the comment and log claiming the existing-DB pass runs "before Lift Mapper Schemifier" have been stale since 2021 (commit ea45370 moved schemifyAll ahead of both passes). The startedBeforeSchemifier flag selects the existing-DB pass, in which post-Schemifier-dependent migrations skip themselves — it does not change call order relative to Schemifier. Reword the Boot comment and the existing-DB log line, and document the real semantics of the flag on Migration.database.executeScripts. The fresh-DB log line is already accurate and is left unchanged. No behaviour change.
…e docs
The resetPasswordUrl description had a raw <email> placeholder and two
dynamic-entity-backup descriptions had a raw <BackupEntityName>
placeholder. When the resource docs are serialized and parsed as XML
(ResourceDocsTest "getResourceDocsObp Api -v6.0.0"), these read as
unterminated XML start tags and throw SAXParseException, failing the
ResourceDocs CI shard.
Replace <email> with the concrete example user@example.com (already the
idiomatic placeholder elsewhere in this file) and <BackupEntityName> with
{BackupEntityName}. Documentation text only.
… DB vendors The natural-key dedup ran DELETE ... WHERE id NOT IN (SELECT MIN(id) FROM sameTable GROUP BY ...). That survivor-set form names the target table inside the DELETE subquery, which MySQL/MariaDB reject with ERROR 1093, and it scales with table size rather than with the number of duplicates. MySQL is a first-class OBP target (driver shipped, per-vendor SQL branches throughout this package), so the form could abort boot on a MySQL deployment while passing on H2/Postgres. Switch to a delete-set form: a derived table over ROW_NUMBER() OVER (PARTITION BY <natural key> ORDER BY id ASC) that targets only the surplus rows. The derived table is materialised, which sidesteps 1093, and window functions are supported by every driver OBP ships (PostgreSQL, H2 2.x, MySQL 8+/MariaDB 10.2+, SQL Server, Oracle). The survivor is unchanged: the lowest id per natural-key group. Also replace execute() with executeUpdate() so the destructive cleanup logs how many rows it removed, and document the lossy survivor policy: discarded duplicates' mentitlementid UUIDs and provenance columns are dropped, while the surviving row still encodes the same grant/ownership. Verified the new SQL on H2 2.2.220 (the shipped version): it keeps the lowest id per group, including for NULL partition keys.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
This branch fixes 20+ concurrency race conditions identified through systematic hazard analysis and concurrency test suite execution. All 20 scenarios in
CONCURRENCY_HAZARDS.mdare now in the Gracefully handled tier.Core fixes
SELECT FOR UPDATErow-level lock (DoobieBankAccountQueries)lockTransactionRequestnow usesrunUpdate(Strategy.default) instead ofrunQuery(Strategy.void) so theFOR UPDATElock is held through the transaction on the fallback pathsaveMe()intryo+ re-fetch onFailureacross 9 providers: OAuth Consumer, SystemView, ViewPermission, AccountHolder, Entitlement, mapping providers (Customer / Transaction / Account), and BadLoginStatusUPDATE WHERE mStatus = guardStatus; removed directmStatus(...).saveMe()that could resurrect revoked consentsaddPermission,addPermissions,revokePermission, andmigrateViewPermissionsagainst duplicate inserts; fixremoveAllViewsAndVierPermissionsto explicitly deleteAccountAccessbeforebulkDelete_!!(which bypassesbeforeDeletehooks)incrementFutureCounter/decrementFutureCounteruseUPDATE ... SET n = n ± 1instead of read-increment-writePost-review correctness fixes (5 additional)
lockTransactionRequest:runQuery→runUpdate(critical: lock was released before payment logic ran on non-request-scoped paths)getOrCreateConsumerre-fetch: use(azp, sub)match; fall back tokeywhen either is absent to avoid matching unrelated consumers via empty-stringremoveAllViewsAndVierPermissions: explicitAccountAccess.bulkDelete_!!beforeViewDefinition.bulkDelete_!!LoginAttempts.incrementBadLoginAttempts: wrap first-timecreate.saveintryomigrate_00000020.sql: dedup prerequisite for newUniqueIndexadditions onmapperaccountholderandmappedentitlementCodebase-wide audit (4 additional mapping provider fixes)
Full scan of the codebase revealed the same unguarded
getOrCreatepattern in three ID-mapping providers andgetOrCreateBadLoginStatus. All four fixed with the sametryo+ re-fetch pattern.Test plan
obp-api/src/test/scala/code/concurrency/CONCURRENCY_HAZARDS.mdpassConcurrentViewPermissionRaceTest— 4 scenarios (N, O, R, migrate)migrate_00000020.sqlagainst any DB with pre-existing data before deploying