Add an optional Numba solver for feature-selected FastL2LiR fitting by izpyon · Pull Request #15 · KamitaniLab/PyFastL2LiR

izpyon · 2026-06-09T08:45:57Z

Summary

This PR adds an optional solver="numba" path to FastL2LiR.fit() to accelerate ridge-regression fitting when feature selection is enabled.

The default behavior remains solver="numpy". Existing users keep the current NumPy implementation unless they explicitly opt in to the experimental Numba solver.

Main Changes

Added solver and numba_num_threads arguments to FastL2LiR.fit()
- solver="numpy": the default implementation
- solver="numba": an optional implementation for the feature-selection fitting path when save_select_feat=False
Added src/fastl2lir/_numba.py
- Runs independent selected-feature ridge solves across output units with numba.njit(parallel=True)
- Warms up the Numba kernel per dtype to reduce first-fit compile/cache-load effects on real benchmark data
- Restores the previous Numba thread count after fitting when numba_num_threads is specified
- Fails fast when Intel MKL is detected as the BLAS backend
Added the numba optional dependency group in pyproject.toml
- numba>=0.59
- scipy>=1.8.0
Added tests/test__numba.py
- Checks that NumPy and Numba solvers agree on weights, bias, and predictions
- Covers float32 fitting with explicit tolerance
- Covers a tied-correlation cutoff case so feature selection is deterministic across both solvers
- Checks invalid solver names and warnings when solver="numba" falls back to a NumPy fitting path
- Checks that the Numba thread count is restored after fitting
- Checks MKL rejection and non-MKL acceptance
Added speedtests/
- Compares current numpy and numba solvers
- Supports legacy-package comparisons
- Supports MKL-oriented environment checks
- Saves synthetic fMRI-like benchmark summaries and prediction-difference checks as CSV files

Implementation Notes

FastL2LiR.fit() keeps the existing setup flow: dtype conversion, optional reshaping of Y, feature-selection settings, chunking, and save_select_feat handling.

The Numba solver is only used for the feature-selection path when save_select_feat=False. The following paths continue to use the existing NumPy implementation:

save_select_feat=True
spatial_norm is specified
select_sample is specified
no feature selection is performed

For the feature-selection fitting path, both solvers use the same mathematical objects:

C: correlations between input features and target units
W0 = X.T @ X + alpha * I: Gram matrix including the bias column
W1 = Y.T @ X: right-hand side for each target unit

The NumPy solver loops over target units, selects the top n_feat features by absolute correlation, appends the bias index, and solves a small ridge system with np.linalg.solve.

The Numba solver passes the same C, W0, and W1 to _fit_selected_ridge_numba(). The Numba kernel parallelizes over target units with prange and, for each unit:

Selects the top n_feat features by absolute correlation
Appends the bias index
Builds the selected submatrix from W0
Builds the matching right-hand side from W1
Solves for ridge weights and bias with np.linalg.solve
Writes the result back into the full feature space

Both NumPy and Numba paths now use stable sorting for feature selection. This makes tied-correlation cutoff behavior deterministic across solvers. In non-tied data this should not change results, but in tied cases the selected features may differ from the previous NumPy quicksort behavior.

For dtype=np.float32, the Numba solver performs the per-unit linear solve internally in float64 and casts the result back to float32. Therefore float32 results are expected to agree with NumPy within float32 tolerance, not bit-for-bit.

Design Principles

Preserve existing behavior by default
- solver="numpy" remains the default.
- Numba is an explicit opt-in path.
Treat numerical validity as more important than speed claims
- The Numba solver is not a different regression algorithm.
- It uses the same selected-feature ridge formulation as the NumPy solver.
- Tests compare weights, bias, and predictions across solvers.
Make runtime environment differences explicit
- BLAS backend and thread settings can affect performance and stability.
- The Numba solver rejects Intel MKL environments for now.
- speedtests/ records environment details, runner labels, dimensions, median runtimes, speedups, and prediction differences.
Keep the dependency optional
- Users who do not opt in to the Numba solver do not need Numba or SciPy.

Known Limitations and Review Notes

solver="numba" is experimental and limited to the feature-selection fitting path with save_select_feat=False.
When solver="numba" is requested for an unsupported fit path, fit() now emits a warning and uses the NumPy path.
Intel MKL is rejected conservatively.
- This could potentially be relaxed later if safe operating conditions are clarified.
The Numba kernel uses prange while each unit calls np.linalg.solve.
- On OpenBLAS or BLIS, nested Numba parallelism and BLAS threading may still oversubscribe CPU threads.
- numba_num_threads controls Numba-side parallelism, but it does not directly control BLAS threads inside the compiled kernel.
- Benchmarking across representative environments is still important.
speedtests/ may be useful but could be considered extra for this PR.
- It helps document performance behavior and environment differences.
- It also increases review scope because the benchmark framework is larger than the core implementation change.
- If preferred, it can be split into a separate PR.

Validation

Validated locally with the project .venv:

.venv/bin/python --version
# Python 3.12.12

.venv/bin/python -m pytest -q
# 16 passed, 91 warnings

.venv/bin/python -m ruff check .
# All checks passed!

.venv/bin/python -m ruff format --check .
# 9 files already formatted

Notes:

Running the system pytest used Python 3.8 and failed to import fastl2lir._numba.
This project declares requires-python = ">=3.10", so the .venv result above is the relevant validation result.

…dation

Copilot

Pull request overview

This PR introduces an opt-in solver="numba" implementation for FastL2LiR.fit() to accelerate the feature-selection ridge fitting path, while keeping the existing NumPy behavior as the default. It also adds tests for solver agreement and a set of speed-test scripts to benchmark runtime and record environment details.

Changes:

Added solver / numba_num_threads parameters and a Numba-based selected-feature ridge fitting path (used only when feature selection is active and save_select_feat=False).
Added optional Numba implementation module (src/fastl2lir/_numba.py) with BLAS-backend validation and warm-up compilation.
Added tests for Numba vs NumPy solver agreement plus a speedtests/ benchmark harness and a numba optional dependency group.

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`src/fastl2lir/fastl2lir.py`	Adds `solver` / `numba_num_threads`, routes selected-feature fitting to NumPy or Numba, and makes feature selection sorting stable for this path.
`src/fastl2lir/_numba.py`	Implements the Numba parallel kernel + environment validation (MKL rejection) + warm-up.
`tests/test__numba.py`	Adds solver equivalence tests, environment validation tests, and thread-count restoration checks.
`tests/test_fastl2lir.py`	Minor formatting/quoting update in an existing test.
`pyproject.toml`	Adds `project.optional-dependencies.numba` for opt-in installation.
`speedtests/*`	Adds scripts and utilities to benchmark/compare solvers and environments, plus documentation.
`.gitignore` / `speedtests/.gitignore`	Ignores numba cache files and local speedtest outputs/config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+import fastl2lir
+import fastl2lir._numba as fastl2lir_numba
+


+    def test_numba_solver_allows_non_mkl_backend(self):
+        """Test that numba environment validation allows non-MKL BLAS."""
+
+        with (


+            warnings.warn(
+                "solver='numba' is only used for feature-selection fitting "
+                "when save_select_feat is False. Falling back to the numpy "
+                "fit path for this call."
+            )


+            elif solver == "numba":
+                if fit_selected_ridge_numba is None:
+                    raise ImportError("solver='numba' requires numba to be installed")
+                W, b = fit_selected_ridge_numba(
+                    X, C, W0, W1, n_feat, dtype, numba_num_threads
+                )


+                    X, C, W0, W1, n_feat, dtype, numba_num_threads
+                )
+            else:
+                raise ValueError("Unknown solver specified:", solver)


+        solver="numpy",
+        numba_num_threads=4,


fix my mistakes... Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

izpyon · 2026-06-09T09:06:03Z

Supplemental benchmark results

I'm sorry but I forgot to include the benchmark numbers in the initial PR description, so I am adding them here as supplementary local validation.

These results were measured on M5-MacBook-Air.local using synthetic fMRI-like data. The benchmark compares the existing NumPy solver and the new optional Numba solver for feature-selected fitting.

Case	X shape	Y shape	k	q	NumPy median	Numba median	Speedup	Max prediction diff
fc8	(1000, 15000)	(1000, 1000)	500	1000	3.5633 s	1.8044 s	1.97x	0.000e+00
fc6	(1000, 15000)	(1000, 4096)	500	4096	12.5028 s	5.2203 s	2.40x	0.000e+00
conv5_chunk5	(1000, 15000)	(1000, 5, 14, 14)	500	980	6.5162 s	4.8589 s	1.34x	0.000e+00
conv4_chunk5	(1000, 15000)	(1000, 5, 28, 28)	500	3920	15.1831 s	8.2120 s	1.85x	0.000e+00

Across these cases, the Numba solver was approximately 1.34x to 2.40x faster than the NumPy solver, while the maximum prediction difference on the checked samples was 0.000e+00.

These numbers should be interpreted as environment-specific benchmark results rather than a general performance guarantee. The actual speedup may depend on the BLAS backend, thread configuration, problem size, and hardware.

ganow

Thanks for working on this, and for keeping the numba path opt-in. Before going into the algorithm details (which I will review separately), I want to step back to the structure of the change and how it should be sequenced, because it overlaps with another in-flight PR.

Context: there is another open PR, #10, that also adds a solver option, but in a different place and form. #10 adds solver to the constructor (FastL2LiR(solver=...)) with values scipy/numpy, and routes every linear solve through it. This PR adds solver to fit() with values numpy/numba, applied to the feature-selection path only. If both land as they are, the library ends up with two different and incompatible solver parameters in two different places. So before merging this, I want to settle one consistent design.

Direction I would like to take:

A single solver, selected once and injected through the constructor, as #10 does. Not a per-call fit() argument, and not two separate solver knobs.
Solver-specific options should live inside the solver, not on the shared API. Right now numba_num_threads is a parameter on fit(), but it only means anything for the numba solver. If we keep adding solvers this way, fit() (and the class) accumulate per-solver knobs (scipy_*, numba_*, and so on). Instead, each solver should own its own configuration so that its dependencies stay contained within it.
The clean way to get both of those is to make a solver a strategy object: an object that bundles a solver's implementation together with its own options, constructed by the user and passed in once, for example FastL2LiR(solver=NumbaSolver(num_threads=8)), with plain strings like "numba" kept as aliases for default configurations.

Because this depends on #10 and on an abstraction that does not exist yet, I would like to split the work into separate PRs rather than land it all here:

Merge #10 first. That establishes the constructor-level solver option.
A PR that introduces the solver strategy abstraction (the interface, plus the existing numpy/scipy solvers expressed through it).
This numba work rebased on top of that abstraction, as a numba strategy.

This is not a rejection of the numba implementation. It is about giving it a stable foundation to sit on, so that we do not ship a second conflicting solver API and then have to migrate it later.

Separately, a dedicated PR for a benchmarking harness would be welcome. The speedtests/ directory here is a good basis for one, but it is a different concern from the solver change and is large enough to review on its own, so I would prefer it not ride along with this PR.

The inline comments are the more specific, code-level points to carry into that work.

ganow · 2026-06-09T10:03:46Z

+    return W.T, b
+
+
+def warmup_numba(dtype=np.float64):


Recommend removing warmup_numba (and its call inside fit_selected_ridge_numba).

Claim: warmup does not reduce a real fit's wall-clock, because numba's JIT compile cost is paid inside the same fit() call whether or not warmup runs first. The compile cost is independent of input size, so compiling on tiny data first buys nothing the real call does not already pay.

Evidence is reproducible. All measurements below are on this PR branch with numba 0.65.1 (Apple Silicon, numpy on the Accelerate backend); absolute times are machine-specific but the relations hold across backends.

Compile cost is independent of data size, while only execution scales. Save as mwe_compile.py and run each size in a fresh process:

import sys, time import numpy as np from numba import njit @njit # cache=False (default): each process compiles fresh, isolating compile cost def sum_sq(a): s = 0.0 for i in range(a.size): s += a[i] * a[i] return s size = int(sys.argv[1]) a = np.ones(size, dtype=np.float64) t0 = time.perf_counter(); sum_sq(a); t1 = time.perf_counter() # compile + run t2 = time.perf_counter(); sum_sq(a); t3 = time.perf_counter() # run only print(f"size={size:>12,} first(compile+run)={ (t1-t0)*1000:8.1f}ms " f"run-only={(t3-t2)*1000:9.4f}ms compile~={(t1-t0-(t3-t2))*1000:7.1f}ms")

$ for n in 8 1000000 100000000; do python mwe_compile.py $n; done

array size first (compile+run) run only compile approx

8 233.0 ms 0.0012 ms 233.0 ms

1,000,000 138.9 ms 0.48 ms 138.4 ms

100,000,000 255.4 ms 50.7 ms 204.7 ms

Compile stays in the same ~140-250 ms band with no upward trend in size (the 8-element case is even highest, i.e. the variation is compile-time jitter, not data size). Only run-only scales with size.

The current warmup is also counterproductive: it builds C as C-contiguous, but the real path passes C.T (F-contiguous), so numba compiles a second specialization the real call never reuses. To see the two specializations:

import numpy as np import fastl2lir import fastl2lir._numba as N X = np.random.default_rng(0).normal(size=(200, 300)) Y = np.random.default_rng(1).normal(size=(200, 50)) fastl2lir.FastL2LiR().fit(X, Y, alpha=1.0, n_feat=20, solver="numba") for sig in N._fit_selected_ridge_numba.signatures: print(sig) # Prints 2 signatures. The 2nd argument (C) is 'C'-contiguous in one # (from warmup) and 'F'-contiguous in the other (the real C.T). The # warmup specialization is never reused.

Single cold fit, warmup vs no warmup. Reproduce by running a single fit(..., solver="numba") on X=(500,800), Y=(500,100), n_feat=50 in a fresh process with a fresh numba cache (NUMBA_CACHE_DIR=$(mktemp -d)), and compare against the same run with warmup disabled (fastl2lir._numba.warmup_numba = lambda dtype=np.float64: None before the fit):

current warmup: 4.51 s (two specializations compiled)

warmup disabled: 3.50 s (one specialization)

warmup fixed to match the real F-contiguous layout: 3.64 s (one specialization)

So even a bug-free warmup is at best neutral and adds its own small overhead; the current one adds ~1 s of wasted compilation on a cold cache.

A JIT first-call spike is inherent to JIT compilation and is acceptable; it does not warrant a warmup step in the library.

ganow · 2026-06-09T10:03:46Z

        cache_dir="./cache",
        dtype=np.float64,
+        solver="numpy",
+        numba_num_threads=4,


Two concerns with numba_num_threads as a public fit() parameter:

It is solver-specific state on the general fit() signature. As more solvers are added, the signature tends to accumulate per-solver keywords (numba_*, and so on). One way to avoid this is to let the chosen solver carry its own options (for example, passing the solver as an object constructed with its own settings) rather than adding solver-specific keywords to fit().

The default 4 caps parallelism. set_num_threads(min(numba_num_threads, previous)) only ever reduces the thread count, so on a many-core machine the default silently runs at <= 4 threads. For a speedup feature that is surprising. Consider defaulting to None (let numba / NUMBA_NUM_THREADS decide) and treating the prange-vs-BLAS oversubscription concern as a solver-internal default rather than a hardcoded cap.

ganow · 2026-06-09T10:03:46Z

        chunk_size=0,
        cache_dir="./cache",
        dtype=np.float64,
+        solver="numpy",


This PR and #10 both introduce a solver option but in incompatible ways: #10 puts it on __init__ with values scipy/numpy (default scipy) and routes every solve, while this PR puts it on fit() with values numpy/numba (default numpy) for the feature-selection path only. They need to be reconciled.

Suggested design: make solver a strategy object selected once (constructor side, consistent with #10), with string aliases for defaults:

FastL2LiR(solver="numba") # alias for the default config FastL2LiR(solver=NumbaSolver(num_threads=8)) # explicit tuning

To keep the dispatch point consistent across solvers, define the strategy interface at the operation level (e.g. a fit_selected(...) that runs the whole feature-selection loop). numpy and scipy share one loop parameterized by an injected single-system solve (this is where #10's per-system solver lives), and numba implements the same operation with its kernel. numba delegates unsupported paths (save_select_feat, spatial_norm, no selection) to the numpy implementation, which is exactly the fallback this PR already warns about.

Suggested sequencing: merge #10 first, then a PR introducing the strategy interface, then this numba work rebased as a numba strategy on top.

ganow · 2026-06-09T10:03:46Z

+    check_numba_solver_environment(dtype)
+
+
+def fit_selected_ridge_numba(X, C, W0, W1, n_feat, dtype, numba_num_threads=4):


fit_selected_ridge_numba is effectively the entry point used by fastl2lir.py, but it lives in a private module (_numba), and the tests import fastl2lir._numba directly. Consider making the public boundary explicit:

If numba is exposed to users through a public handle (for example a NumbaSolver class in a public solvers module), keep the njit kernel, warmup, and MKL detection private in _numba.py and have that public class wrap them. Tests would then exercise the public API, with only kernel-specific unit tests reaching into _numba.

If numba stays a plain string option (solver="numba") with no public class, rename this function to _fit_selected_ridge_numba to mark it internal.

ganow · 2026-06-09T10:03:47Z

+                with threadpool_limits(limits=1, user_api="blas"):
+                    for index_outputDim in tqdm(range(Y.shape[1])):
+                        C0 = abs(C[index_outputDim, :])
+                        feat_idx = np.argsort(C0, kind="mergesort")


Why is the numpy path switched from np.argsort(C0) (quicksort) to kind="mergesort"?

This changes the existing numpy path, not just the numba path. The previous quicksort is unstable, so when two features tie in absolute correlation at the n_feat cutoff, switching to a stable sort can change which feature is selected, and therefore the resulting W and predictions, relative to previous releases. Non-tied data is unaffected. This also affects current users who never use the numba solver.

Could you explain what motivates the change (for example, deterministic/reproducible tie-breaking, or making the numpy and numba solvers agree in the cross-solver test)? Flagging it because it alters existing default behavior.

ganow · 2026-06-09T10:03:47Z

@@ -0,0 +1,86 @@
+# Speed Tests


Could we split speedtests/ out into its own PR rather than landing it here? It is ~900 lines across several files and inflates this PR's review scope well beyond the ~157-line core numba change, and it is a separate concern (benchmark tooling) from the solver feature itself. Keeping them apart makes both easier to review and gets the solver work merged sooner.

To be clear, this is worth keeping, not dropping. A reproducible benchmark harness that records the BLAS backend, thread settings, and prediction-difference checks is genuinely valuable for a library whose main claim is speed. I would like it back as a dedicated PR. When it returns, a few portability issues are worth fixing so that others can actually run it:

speedrun_legacy.sh runs conda activate base and then executes ${LEGACY_CONDA}/bin/python. This assumes the user has conda installed and that the legacy fastl2lir lives specifically in conda's base environment. Anyone who keeps the legacy package in a named env, or who does not use conda at all, cannot run this. The environment and interpreter should be configurable rather than hardcoded to base.

The legacy and MKL runners default to the local paths .venv-legacy and .venv-mkl (in local.sh.example, speedrun_legacy.sh, and speedrun_mkl.sh). Those directories will not exist on another machine, so those two runners only work for whoever happens to have environments at exactly those paths. That is fine as a personal convenience, but it should be documented as required local setup rather than presented as a runner that works out of the box.

speedrun_mkl.sh defaults MKL_BENCH_RUNNERS="numpy,numba", so it tries to run the numba solver. But the numba solver raises RuntimeError when it detects an MKL BLAS backend, which is exactly the environment this script targets. As written, the MKL speed run would crash on the numba runner. Either the MKL run should exclude numba, or the MKL rejection policy needs revisiting.

A clear home such as a benchmarks/ directory would also help.

For this PR, it would be enough to run the benchmark on these changes and share a simple per-environment speed comparison (a small table of numpy vs numba median times / speedups across the environments you tested) in the PR description or a comment.

izpyon added 5 commits June 9, 2026 17:07

Add optional numba solver

5641ac3

docs: update numba solver documentation to clarify dtype behavior

2b69f81

fix: use mergesort for stable feature index sorting in numba solver

197cfa3

fix: validate solver input and improve numba feature selection handling

490d86f

test: add comprehensive tests for numba solver functionality and vali…

d9912bc

…dation

izpyon marked this pull request as ready for review June 9, 2026 08:46

KenyaOtsuka requested review from KenyaOtsuka and Copilot June 9, 2026 08:48

Copilot started reviewing on behalf of KenyaOtsuka June 9, 2026 08:48 View session

KenyaOtsuka requested a review from ganow June 9, 2026 08:48

Copilot AI reviewed Jun 9, 2026

View reviewed changes

Potential fix for pull request finding

b776b98

fix my mistakes... Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

ganow requested changes Jun 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an optional Numba solver for feature-selected FastL2LiR fitting#15

Add an optional Numba solver for feature-selected FastL2LiR fitting#15
izpyon wants to merge 6 commits into
KamitaniLab:devfrom
izpyon:add/numba-solver

izpyon commented Jun 9, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

izpyon commented Jun 9, 2026

Uh oh!

ganow left a comment

Uh oh!

ganow Jun 9, 2026

Uh oh!

ganow Jun 9, 2026

Uh oh!

ganow Jun 9, 2026

Uh oh!

ganow Jun 9, 2026

Uh oh!

ganow Jun 9, 2026

Uh oh!

ganow Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

array size	first (compile+run)	run only	compile approx
8	233.0 ms	0.0012 ms	233.0 ms
1,000,000	138.9 ms	0.48 ms	138.4 ms
100,000,000	255.4 ms	50.7 ms	204.7 ms

		check_numba_solver_environment(dtype)


		def fit_selected_ridge_numba(X, C, W0, W1, n_feat, dtype, numba_num_threads=4):

Conversation

izpyon commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Main Changes

Implementation Notes

Design Principles

Known Limitations and Review Notes

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

izpyon commented Jun 9, 2026

Supplemental benchmark results

Uh oh!

ganow left a comment

Choose a reason for hiding this comment

Uh oh!

ganow Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

ganow Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

ganow Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

ganow Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

ganow Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

ganow Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

izpyon commented Jun 9, 2026 •

edited

Loading