Add an optional Numba solver for feature-selected FastL2LiR fitting#15
Add an optional Numba solver for feature-selected FastL2LiR fitting#15izpyon wants to merge 6 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces an opt-in solver="numba" implementation for FastL2LiR.fit() to accelerate the feature-selection ridge fitting path, while keeping the existing NumPy behavior as the default. It also adds tests for solver agreement and a set of speed-test scripts to benchmark runtime and record environment details.
Changes:
- Added
solver/numba_num_threadsparameters and a Numba-based selected-feature ridge fitting path (used only when feature selection is active andsave_select_feat=False). - Added optional Numba implementation module (
src/fastl2lir/_numba.py) with BLAS-backend validation and warm-up compilation. - Added tests for Numba vs NumPy solver agreement plus a
speedtests/benchmark harness and anumbaoptional dependency group.
Reviewed changes
Copilot reviewed 14 out of 15 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
src/fastl2lir/fastl2lir.py |
Adds solver / numba_num_threads, routes selected-feature fitting to NumPy or Numba, and makes feature selection sorting stable for this path. |
src/fastl2lir/_numba.py |
Implements the Numba parallel kernel + environment validation (MKL rejection) + warm-up. |
tests/test__numba.py |
Adds solver equivalence tests, environment validation tests, and thread-count restoration checks. |
tests/test_fastl2lir.py |
Minor formatting/quoting update in an existing test. |
pyproject.toml |
Adds project.optional-dependencies.numba for opt-in installation. |
speedtests/* |
Adds scripts and utilities to benchmark/compare solvers and environments, plus documentation. |
.gitignore / speedtests/.gitignore |
Ignores numba cache files and local speedtest outputs/config. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import fastl2lir | ||
| import fastl2lir._numba as fastl2lir_numba | ||
|
|
| def test_numba_solver_allows_non_mkl_backend(self): | ||
| """Test that numba environment validation allows non-MKL BLAS.""" | ||
|
|
||
| with ( |
| warnings.warn( | ||
| "solver='numba' is only used for feature-selection fitting " | ||
| "when save_select_feat is False. Falling back to the numpy " | ||
| "fit path for this call." | ||
| ) |
| elif solver == "numba": | ||
| if fit_selected_ridge_numba is None: | ||
| raise ImportError("solver='numba' requires numba to be installed") | ||
| W, b = fit_selected_ridge_numba( | ||
| X, C, W0, W1, n_feat, dtype, numba_num_threads | ||
| ) |
| X, C, W0, W1, n_feat, dtype, numba_num_threads | ||
| ) | ||
| else: | ||
| raise ValueError("Unknown solver specified:", solver) |
| solver="numpy", | ||
| numba_num_threads=4, |
fix my mistakes... Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Supplemental benchmark resultsI'm sorry but I forgot to include the benchmark numbers in the initial PR description, so I am adding them here as supplementary local validation. These results were measured on
Across these cases, the Numba solver was approximately These numbers should be interpreted as environment-specific benchmark results rather than a general performance guarantee. The actual speedup may depend on the BLAS backend, thread configuration, problem size, and hardware. |
ganow
left a comment
There was a problem hiding this comment.
Thanks for working on this, and for keeping the numba path opt-in. Before going into the algorithm details (which I will review separately), I want to step back to the structure of the change and how it should be sequenced, because it overlaps with another in-flight PR.
Context: there is another open PR, #10, that also adds a solver option, but in a different place and form. #10 adds solver to the constructor (FastL2LiR(solver=...)) with values scipy/numpy, and routes every linear solve through it. This PR adds solver to fit() with values numpy/numba, applied to the feature-selection path only. If both land as they are, the library ends up with two different and incompatible solver parameters in two different places. So before merging this, I want to settle one consistent design.
Direction I would like to take:
- A single
solver, selected once and injected through the constructor, as #10 does. Not a per-callfit()argument, and not two separatesolverknobs. - Solver-specific options should live inside the solver, not on the shared API. Right now
numba_num_threadsis a parameter onfit(), but it only means anything for the numba solver. If we keep adding solvers this way,fit()(and the class) accumulate per-solver knobs (scipy_*,numba_*, and so on). Instead, each solver should own its own configuration so that its dependencies stay contained within it. - The clean way to get both of those is to make a solver a strategy object: an object that bundles a solver's implementation together with its own options, constructed by the user and passed in once, for example
FastL2LiR(solver=NumbaSolver(num_threads=8)), with plain strings like"numba"kept as aliases for default configurations.
Because this depends on #10 and on an abstraction that does not exist yet, I would like to split the work into separate PRs rather than land it all here:
- Merge #10 first. That establishes the constructor-level
solveroption. - A PR that introduces the solver strategy abstraction (the interface, plus the existing numpy/scipy solvers expressed through it).
- This numba work rebased on top of that abstraction, as a numba strategy.
This is not a rejection of the numba implementation. It is about giving it a stable foundation to sit on, so that we do not ship a second conflicting solver API and then have to migrate it later.
Separately, a dedicated PR for a benchmarking harness would be welcome. The speedtests/ directory here is a good basis for one, but it is a different concern from the solver change and is large enough to review on its own, so I would prefer it not ride along with this PR.
The inline comments are the more specific, code-level points to carry into that work.
| return W.T, b | ||
|
|
||
|
|
||
| def warmup_numba(dtype=np.float64): |
There was a problem hiding this comment.
Recommend removing warmup_numba (and its call inside fit_selected_ridge_numba).
Claim: warmup does not reduce a real fit's wall-clock, because numba's JIT compile cost is paid inside the same fit() call whether or not warmup runs first. The compile cost is independent of input size, so compiling on tiny data first buys nothing the real call does not already pay.
Evidence is reproducible. All measurements below are on this PR branch with numba 0.65.1 (Apple Silicon, numpy on the Accelerate backend); absolute times are machine-specific but the relations hold across backends.
- Compile cost is independent of data size, while only execution scales. Save as
mwe_compile.pyand run each size in a fresh process:
import sys, time
import numpy as np
from numba import njit
@njit # cache=False (default): each process compiles fresh, isolating compile cost
def sum_sq(a):
s = 0.0
for i in range(a.size):
s += a[i] * a[i]
return s
size = int(sys.argv[1])
a = np.ones(size, dtype=np.float64)
t0 = time.perf_counter(); sum_sq(a); t1 = time.perf_counter() # compile + run
t2 = time.perf_counter(); sum_sq(a); t3 = time.perf_counter() # run only
print(f"size={size:>12,} first(compile+run)={ (t1-t0)*1000:8.1f}ms "
f"run-only={(t3-t2)*1000:9.4f}ms compile~={(t1-t0-(t3-t2))*1000:7.1f}ms")$ for n in 8 1000000 100000000; do python mwe_compile.py $n; done
| array size | first (compile+run) | run only | compile approx |
|---|---|---|---|
| 8 | 233.0 ms | 0.0012 ms | 233.0 ms |
| 1,000,000 | 138.9 ms | 0.48 ms | 138.4 ms |
| 100,000,000 | 255.4 ms | 50.7 ms | 204.7 ms |
Compile stays in the same ~140-250 ms band with no upward trend in size (the 8-element case is even highest, i.e. the variation is compile-time jitter, not data size). Only run-only scales with size.
- The current warmup is also counterproductive: it builds
Cas C-contiguous, but the real path passesC.T(F-contiguous), so numba compiles a second specialization the real call never reuses. To see the two specializations:
import numpy as np
import fastl2lir
import fastl2lir._numba as N
X = np.random.default_rng(0).normal(size=(200, 300))
Y = np.random.default_rng(1).normal(size=(200, 50))
fastl2lir.FastL2LiR().fit(X, Y, alpha=1.0, n_feat=20, solver="numba")
for sig in N._fit_selected_ridge_numba.signatures:
print(sig)
# Prints 2 signatures. The 2nd argument (C) is 'C'-contiguous in one
# (from warmup) and 'F'-contiguous in the other (the real C.T). The
# warmup specialization is never reused.- Single cold fit, warmup vs no warmup. Reproduce by running a single
fit(..., solver="numba")onX=(500,800),Y=(500,100),n_feat=50in a fresh process with a fresh numba cache (NUMBA_CACHE_DIR=$(mktemp -d)), and compare against the same run with warmup disabled (fastl2lir._numba.warmup_numba = lambda dtype=np.float64: Nonebefore the fit):
- current warmup: 4.51 s (two specializations compiled)
- warmup disabled: 3.50 s (one specialization)
- warmup fixed to match the real F-contiguous layout: 3.64 s (one specialization)
So even a bug-free warmup is at best neutral and adds its own small overhead; the current one adds ~1 s of wasted compilation on a cold cache.
A JIT first-call spike is inherent to JIT compilation and is acceptable; it does not warrant a warmup step in the library.
| cache_dir="./cache", | ||
| dtype=np.float64, | ||
| solver="numpy", | ||
| numba_num_threads=4, |
There was a problem hiding this comment.
Two concerns with numba_num_threads as a public fit() parameter:
-
It is solver-specific state on the general
fit()signature. As more solvers are added, the signature tends to accumulate per-solver keywords (numba_*, and so on). One way to avoid this is to let the chosen solver carry its own options (for example, passing the solver as an object constructed with its own settings) rather than adding solver-specific keywords tofit(). -
The default
4caps parallelism.set_num_threads(min(numba_num_threads, previous))only ever reduces the thread count, so on a many-core machine the default silently runs at <= 4 threads. For a speedup feature that is surprising. Consider defaulting toNone(let numba /NUMBA_NUM_THREADSdecide) and treating the prange-vs-BLAS oversubscription concern as a solver-internal default rather than a hardcoded cap.
| chunk_size=0, | ||
| cache_dir="./cache", | ||
| dtype=np.float64, | ||
| solver="numpy", |
There was a problem hiding this comment.
This PR and #10 both introduce a solver option but in incompatible ways: #10 puts it on __init__ with values scipy/numpy (default scipy) and routes every solve, while this PR puts it on fit() with values numpy/numba (default numpy) for the feature-selection path only. They need to be reconciled.
Suggested design: make solver a strategy object selected once (constructor side, consistent with #10), with string aliases for defaults:
FastL2LiR(solver="numba") # alias for the default config
FastL2LiR(solver=NumbaSolver(num_threads=8)) # explicit tuning
To keep the dispatch point consistent across solvers, define the strategy interface at the operation level (e.g. a fit_selected(...) that runs the whole feature-selection loop). numpy and scipy share one loop parameterized by an injected single-system solve (this is where #10's per-system solver lives), and numba implements the same operation with its kernel. numba delegates unsupported paths (save_select_feat, spatial_norm, no selection) to the numpy implementation, which is exactly the fallback this PR already warns about.
Suggested sequencing: merge #10 first, then a PR introducing the strategy interface, then this numba work rebased as a numba strategy on top.
| check_numba_solver_environment(dtype) | ||
|
|
||
|
|
||
| def fit_selected_ridge_numba(X, C, W0, W1, n_feat, dtype, numba_num_threads=4): |
There was a problem hiding this comment.
fit_selected_ridge_numba is effectively the entry point used by fastl2lir.py, but it lives in a private module (_numba), and the tests import fastl2lir._numba directly. Consider making the public boundary explicit:
- If numba is exposed to users through a public handle (for example a
NumbaSolverclass in a publicsolversmodule), keep the njit kernel, warmup, and MKL detection private in_numba.pyand have that public class wrap them. Tests would then exercise the public API, with only kernel-specific unit tests reaching into_numba. - If numba stays a plain string option (
solver="numba") with no public class, rename this function to_fit_selected_ridge_numbato mark it internal.
| with threadpool_limits(limits=1, user_api="blas"): | ||
| for index_outputDim in tqdm(range(Y.shape[1])): | ||
| C0 = abs(C[index_outputDim, :]) | ||
| feat_idx = np.argsort(C0, kind="mergesort") |
There was a problem hiding this comment.
Why is the numpy path switched from np.argsort(C0) (quicksort) to kind="mergesort"?
This changes the existing numpy path, not just the numba path. The previous quicksort is unstable, so when two features tie in absolute correlation at the n_feat cutoff, switching to a stable sort can change which feature is selected, and therefore the resulting W and predictions, relative to previous releases. Non-tied data is unaffected. This also affects current users who never use the numba solver.
Could you explain what motivates the change (for example, deterministic/reproducible tie-breaking, or making the numpy and numba solvers agree in the cross-solver test)? Flagging it because it alters existing default behavior.
| @@ -0,0 +1,86 @@ | |||
| # Speed Tests | |||
There was a problem hiding this comment.
Could we split speedtests/ out into its own PR rather than landing it here? It is ~900 lines across several files and inflates this PR's review scope well beyond the ~157-line core numba change, and it is a separate concern (benchmark tooling) from the solver feature itself. Keeping them apart makes both easier to review and gets the solver work merged sooner.
To be clear, this is worth keeping, not dropping. A reproducible benchmark harness that records the BLAS backend, thread settings, and prediction-difference checks is genuinely valuable for a library whose main claim is speed. I would like it back as a dedicated PR. When it returns, a few portability issues are worth fixing so that others can actually run it:
-
speedrun_legacy.shrunsconda activate baseand then executes${LEGACY_CONDA}/bin/python. This assumes the user has conda installed and that the legacyfastl2lirlives specifically in conda'sbaseenvironment. Anyone who keeps the legacy package in a named env, or who does not use conda at all, cannot run this. The environment and interpreter should be configurable rather than hardcoded tobase. -
The legacy and MKL runners default to the local paths
.venv-legacyand.venv-mkl(inlocal.sh.example,speedrun_legacy.sh, andspeedrun_mkl.sh). Those directories will not exist on another machine, so those two runners only work for whoever happens to have environments at exactly those paths. That is fine as a personal convenience, but it should be documented as required local setup rather than presented as a runner that works out of the box. -
speedrun_mkl.shdefaultsMKL_BENCH_RUNNERS="numpy,numba", so it tries to run the numba solver. But the numba solver raisesRuntimeErrorwhen it detects an MKL BLAS backend, which is exactly the environment this script targets. As written, the MKL speed run would crash on the numba runner. Either the MKL run should exclude numba, or the MKL rejection policy needs revisiting.
A clear home such as a benchmarks/ directory would also help.
For this PR, it would be enough to run the benchmark on these changes and share a simple per-environment speed comparison (a small table of numpy vs numba median times / speedups across the environments you tested) in the PR description or a comment.
Summary
This PR adds an optional
solver="numba"path toFastL2LiR.fit()to accelerate ridge-regression fitting when feature selection is enabled.The default behavior remains
solver="numpy". Existing users keep the current NumPy implementation unless they explicitly opt in to the experimental Numba solver.Main Changes
solverandnumba_num_threadsarguments toFastL2LiR.fit()solver="numpy": the default implementationsolver="numba": an optional implementation for the feature-selection fitting path whensave_select_feat=Falsesrc/fastl2lir/_numba.pynumba.njit(parallel=True)numba_num_threadsis specifiednumbaoptional dependency group inpyproject.tomlnumba>=0.59scipy>=1.8.0tests/test__numba.pyfloat32fitting with explicit tolerancesolver="numba"falls back to a NumPy fitting pathspeedtests/numpyandnumbasolversImplementation Notes
FastL2LiR.fit()keeps the existing setup flow: dtype conversion, optional reshaping ofY, feature-selection settings, chunking, andsave_select_feathandling.The Numba solver is only used for the feature-selection path when
save_select_feat=False. The following paths continue to use the existing NumPy implementation:save_select_feat=Truespatial_normis specifiedselect_sampleis specifiedFor the feature-selection fitting path, both solvers use the same mathematical objects:
C: correlations between input features and target unitsW0 = X.T @ X + alpha * I: Gram matrix including the bias columnW1 = Y.T @ X: right-hand side for each target unitThe NumPy solver loops over target units, selects the top
n_featfeatures by absolute correlation, appends the bias index, and solves a small ridge system withnp.linalg.solve.The Numba solver passes the same
C,W0, andW1to_fit_selected_ridge_numba(). The Numba kernel parallelizes over target units withprangeand, for each unit:n_featfeatures by absolute correlationW0W1np.linalg.solveBoth NumPy and Numba paths now use stable sorting for feature selection. This makes tied-correlation cutoff behavior deterministic across solvers. In non-tied data this should not change results, but in tied cases the selected features may differ from the previous NumPy quicksort behavior.
For
dtype=np.float32, the Numba solver performs the per-unit linear solve internally in float64 and casts the result back to float32. Therefore float32 results are expected to agree with NumPy within float32 tolerance, not bit-for-bit.Design Principles
solver="numpy"remains the default.speedtests/records environment details, runner labels, dimensions, median runtimes, speedups, and prediction differences.Known Limitations and Review Notes
solver="numba"is experimental and limited to the feature-selection fitting path withsave_select_feat=False.solver="numba"is requested for an unsupported fit path,fit()now emits a warning and uses the NumPy path.prangewhile each unit callsnp.linalg.solve.numba_num_threadscontrols Numba-side parallelism, but it does not directly control BLAS threads inside the compiled kernel.speedtests/may be useful but could be considered extra for this PR.Validation
Validated locally with the project
.venv:Notes:
pytestused Python 3.8 and failed to importfastl2lir._numba.requires-python = ">=3.10", so the.venvresult above is the relevant validation result.