Open up `bdpy.mri.fmriprep` tests: surface GM, split_task_label, real_data marker, test_metrics FP fix by kencan7749 · Pull Request #116 · KamitaniLab/bdpy

kencan7749 · 2026-05-11T05:13:40Z

I am continuing the previous contributor's work to open up the test code for bdpy.mri.fmriprep.
This PR builds directly on the previous test-workflow PR, so read against dev the combined diff introduces the full test workflow plus the additions listed below.

What the previous contributor did (detal in in the PR)

Set up the mock and real-data test scaffolding under tests/mri/fmriprep/:
- test_fmriprep_mock.py (mock dataset, ~34 tests including a create_bdata_fmriprep golden-master comparison)
- test_fmriprep_real.py (real-data golden-master comparison against OpenNeuro ds006319)
- test_fmriprep_utils_mock.py (MockBidsBuilder and expected-data helpers)
- test_fmriprep_utils.py (shared config and RealDatasetMixin)
Documented operation and coverage in README.md and TEST_COVERAGE.md.
Mock-test helper scripts (scripts/mock/step_1_prepare_gm.sh, step_2_run_test.sh).
Real-data helper scripts (scripts/real/step_1_download.sh through step_5_run_test.sh) wiring FreeSurfer, Docker, and fMRIPrep 1.2.1, with _fs_env.sh.example for machine-specific overrides.
.gitignore rules that track only .h5 golden masters and the relevant scripts.

What I did in this PR

Fixed tests/evals/test_metrics.py failure on the base env (Python 3.8). Replaced 5 self.assertTrue(np.array_equal(...)) in test_2d and test_2d_nan with np.testing.assert_allclose(rtol=1e-12, atol=0). The pickled expected values drift by ~1 ULP (1.1e-16–2.2e-16) from the current numpy/BLAS output; np.allclose already passes, so this is FP-rounding version drift rather than an algorithm change. The implementation (bdpy/evals/metrics.py) and the pickle fixtures are untouched.
Added value-level golden-master coverage for surface_native. New test TestCreateBdataFmriprepMock.test_create_bdata_fmriprep_surface_native_gm with with_confounds=True and exclude={"session/run": [[1, 2], None]}, comparing VertexData, VertexLeft/VertexRight, vertex_index, motion, confounds, and label submeta. The previous surface tests were shape-only, so regressions in BrainData.__load_surface could pass undetected. New fixture: tests/data/mri/golden_master/mock/test_output_fmriprep_subject_surface.h5 (~2 MB, regeneratable via TEST_FMRIPREP_CREATE_GOLDEN_MASTER=1).
Added mock coverage for the split_task_label=True branch. New test test_create_bdata_fmriprep_split_task_label_single_task exercises the branch on the single-task mock dataset, asserting data_labels == ["sub-0840_task-mock"] and that the per-task BData matches the existing non-split GM. Previously this branch was covered only by the real-data test.
Extended build_expected_bdata_after_exclude() in tests/mri/fmriprep/test_fmriprep_utils_mock.py with a backward-compatible data_mode parameter. For surface modes it loads GIFTI via nibabel and hstacks left/right hemispheres to mirror BrainData.__load_surface.
Introduced a real_data pytest marker. Declared in pyproject.toml under [tool.pytest.ini_options] and applied to TestCreateBdataFmriprepReal. CI can now use pytest -m "not real_data" for an explicit exclusion; the dynamic unittest.SkipTest fallback in RealDatasetMixin.setUpClass is preserved.
Refreshed documentation and scripts.
- ./test/mri/fmriprep/README.md: added the new fixture and documented pytest -m "not real_data" in the Real-Data Tests section.
- ./test/mri/fmriprep/TEST_COVERAGE.md: updated Last updated, reclassified the surface_native and split_task_label gaps as covered.
- scripts/mock/step_1_prepare_gm.sh: tracks the new surface fixture.

Verification:

pytest tests/evals/test_metrics.py -v → 4 passed (Python 3.8 base env)
pytest tests/mri/fmriprep/ -m "not real_data" -v → 36 passed, 1 deselected
pytest tests/ -m "not slow and not real_data" -q → 245 passed, 1 deselected, 0 failed

Known bugs and weak points, acknowledged but not yet fixed

Implementation-side bugs in bdpy/mri/fmriprep.py (deferred to a separate PR):

BrainData.__dtype is 'surface' / is not 'surface' identity comparisons (L388, L398). Works today by Python short-string interning, but is semantically wrong; will surface as warnings or break on future Python versions. Should be == / !=.
del fmriprep.data[sub] / del fmriprep.data[sub][ses] during iteration of the OrderedDict in create_bdata_fmriprep (L270–L279). Can raise RuntimeError: dictionary changed size during iteration. Masked today because the mock dataset has a single subject and del happens at the end of iteration (see the multi-subject gap below — they hide each other).

Test coverage gaps:

Multi-task split_task_label=True (multiple elements in bdata_list) is covered only by the real-data test. Mock-side coverage requires extending MockBidsBuilder to emit multiple task-* labels.
Multi-subject GM is absent. MockBidsBuilder emits a single subject, so subject-iteration accumulation (last_run, last_block) and partial-subject exclude are not value-tested. This is what masks the OrderedDict-mutation bug above.
surface_standard, surface_standard_41k, surface_standard_10k are still shape-only; value-level GM would catch path-specific regressions in __load_surface for the fsaverage* family.
return_list=False unwrap path (bdpy/mri/fmriprep.py L352–L360) has no explicit assertion; all tests use return_list=True.
csv label_mapper loading, LabelMapper.dump(), and the non-unique-value RuntimeError('Invalid label-value mapping') path are not directly asserted.
fMRIPrep version 1.0 / 1.1 file-pattern branches are not exercised (MockBidsBuilder targets 1.2).

micchu · 2026-05-11T07:20:30Z

@kencan7749, @izpyon

Thank you both for your work on the fMRIPrep test workflow. I understand that #116 by @kencan7749 builds on the implementation introduced in #115 by @izpyon and adds further updates.

Given this relationship, I think it may be practical to consolidate the review here, provided that the contribution from #115 remains properly credited and both authors are aligned with that direction.

@kencan7749, would you be willing to take over the follow-up changes and discussion points originally raised for #115 in this PR?

Regarding the real-data workflow from #115

One point I would like to revisit is the real-data workflow originally introduced in #115.

The current workflow downloads raw BIDS data from OpenNeuro via DataLad and then runs FreeSurfer / fMRIPrep locally. While this is reproducible, it may be too heavy for a test fixture workflow, and DataLad is not currently part of our standard lab workflow.

Could you consider revising the real-data fixture workflow so that a small fMRIPrep-processed test fixture is hosted externally, for example on Figshare, and downloaded explicitly only when running the real_data tests?

In that case, the repository would contain the test code and download / verification scripts, but not the real-data-derived binary files themselves. It would also be helpful to document the fixture version, checksum, source OpenNeuro dataset, fMRIPrep / FreeSurfer versions, Docker image tag, generation command, and expected directory structure in the README.

Regarding the `real_data` marker

Thank you for introducing the real_data pytest marker. I think this is a very reasonable implementation.

As mentioned above, real-data test fixtures should preferably be obtained from an external data repository. Therefore, operations such as external network access or downloading large files should not be part of the default test suite. Separating these tests with the real_data marker seems appropriate.

Regarding `surface_native` golden-master coverage

Thank you for also adding coverage for the surface_native path.

I was initially concerned about possible FreeSurfer version dependency in value-level surface data check. However, as far as I understand, this test uses artificial GIFTI functional data generated by the mock builder, rather than FreeSurfer-derived surface geometry. Therefore, I think this is reasonable as a unit / regression test.

On the other hand, if you add golden-master coverage for real fMRIPrep surface outputs in the future, the values and vertex correspondence may depend on the FreeSurfer / fMRIPrep versions and resampling settings. In that case, the relevant versions should be pinned and clearly documented in the README.

Regarding known bugs and weak points

Thank you for summarizing the known weak points.

For the string identity comparisons in `BrainData`,

I agree that is / is not should be replaced with == / !=. This seems like a small fix, so if it is not addressed in this PR, it would be good to track it as a follow-up issue or a small separate PR.

For the `OrderedDict` mutation during iteration in `create_bdata_fmriprep()`,

could you please open an issue for this? As you pointed out, the exclude logic currently deletes entries using del fmriprep.data[sub] and del fmriprep.data[sub][ses] while iterating over the same OrderedDict. This can raise RuntimeError: OrderedDict mutated during iteration in multi-subject or multi-session cases. It looks like the current mock dataset has only a single subject, and the deletion happens at the end of the iteration, so the issue is currently masked. However, we are considering making the lab shared-directory structure more BIDS-like in the future. If that happens, this part could become a serious issue. In the long term, it would be safer to avoid deleting entries during iteration and instead rebuild the remaining subject / session / run structure using a filtering-based implementation.

…l_data marker Phase 1 of bdpy.mri.fmriprep test code opening: - Extend build_expected_bdata_after_exclude() in test_fmriprep_utils_mock.py with a data_mode parameter supporting all four surface modes (native and three standard variants). For surface modes, the helper now loads GIFTI files via nibabel and hstacks L/R hemispheres to mirror the production BrainData.__load_surface path, and emits VertexData / VertexLeft / VertexRight / vertex_index metadata in place of voxel coordinates. - Add TestCreateBdataFmriprepMock.test_create_bdata_fmriprep_surface_native_gm: value-level golden-master comparison for surface_native with with_confounds=True and exclude={"session/run": [[1, 2], None]}, mirroring the existing volume_native GM coverage. New fixture: tests/data/mri/golden_master/mock/test_output_fmriprep_subject_surface.h5 - Add TestCreateBdataFmriprepMock.test_create_bdata_fmriprep_split_task_label_single_task: exercises split_task_label=True. The mock dataset has a single task ("task-mock"), so this verifies the data_labels suffix format and that the bdata content matches the non-split GM. Multi-task coverage is provided by test_fmriprep_real.py. - Introduce a "real_data" pytest marker (declared in pyproject.toml and applied to TestCreateBdataFmriprepReal). CI can now run `pytest -m "not real_data"` to exclude the real dataset test cleanly. - Update tests/mri/fmriprep/scripts/mock/step_1_prepare_gm.sh to track the new surface GM file, and refresh README.md / TEST_COVERAGE.md. Verification: `pytest tests/ -m "not slow and not real_data" --tb=line -q` => 245 passed, 1 deselected, 0 failed (Phase 0 baseline was 243). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

kencan7749 · 2026-05-11T08:27:47Z

Thank you for the suggestion.
Based on @ganow's suggestion, I have split the test_metrics floating-point comparison fix into a separate PR and updated this branch accordingly (#117).
This PR now keeps only the surface_native golden master, split_task_label mock test, and real_data marker changes.

Based on the suggestion by @micchu

I agree with the proposed direction. I will continue the follow-up work based on this PR, while making sure that the contribution from Add tests/mri fmriprep for fmriprep version 1.2.1 #115 remains properly credited.
I also understand the points regarding the real-data fixture workflow, the real_data marker, the surface_native golden-master coverage, and the known weak points. I will consider these points and address them in the follow-up changes or separate issues/PRs as appropriate.

kencan7749 · 2026-05-22T04:35:37Z

@micchu @izpyon

Regarding the testing real-data workflow, I'd like to share what I noticed during my investigation and get your opinion.

My understanding is that this bdpy.mri.fmriprep module is used for creating bdata after fMRIPrep processing. In @izpyon's previous implementation, the flow was to actually download the raw data and then process it, but one area for improvement was that creating the intermediate files took a long time. So, based on @micchu's earlier comment, I'm planning to revise this by uploading a small set of fMRIPrep-preprocessed files to figshare, and then explicitly downloading them when running the real_data tests.

How should we prepare these fMRIPrep-preprocessed files? Are there candidate datasets, sessions, or runs that you have in mind? If data along the lines of the OpenNeuro datasets that @izpyon used previously would be best suited for this test, it would be a great help if @izpyon could upload the preprocessed files.

In addition, I've confirmed that this module cannot be tested with fMRIPrep-derived files alone. Beyond the fMRIPrep-derived files, the following two files are also required:

*_events.tsv (file related to stimulus information)
*_bold.json (file used to extract TR information)

Both will be standard files included in the raw BIDS data, so we need to be aware of this when uploading the data.

Note that *_events.tsv is essential because fMRIPrep does not touch it. However, TR information should also be available in the fMRIPrep-processed files. Actually, the current code uses *_bold.json justt for:

bdpy/bdpy/mri/fmriprep.py

Lines 188 to 193 in 9d0b724

    
           bold_json_file_name_glob = '%s_%s_task-*_%s_bold.json' % (sbj, ses, run_label) 
        
           bold_json_file_list = glob.glob(os.path.join(raw_func_dir, bold_json_file_name_glob)) 
        
           if len(bold_json_file_list) != 1: 
        
               raise RuntimeError('Something is wrong on bold parameter json files.') 
        
           bold_json_file = bold_json_file_list[0].replace(os.path.normpath(self.__datapath) + '/', '') 
        
           run.update({'bold_json': bold_json_file})

and

bdpy/bdpy/mri/fmriprep.py

Lines 610 to 612 in 9d0b724

    
           with open(os.path.join(data_path, run['bold_json']), 'r') as f: 
        
               bold_metainfo = json.load(f) 
        
           tr = bold_metainfo['RepetitionTime']

so I think *_bold.json might become unnecessary in the future.

I'd appreciate your thoughts on (1) how to prepare these fMRIPrep-preprocessed files, and (2) how we should handle *_bold.json going forward.

kencan7749 · 2026-05-22T08:44:34Z

@micchu @izpyon,
This is a follow-up about preparing a minimal dataset.
Related to this issue, I think minimal data should contain multi-subject and multiple sessions/runs so that the exclude flag can be tested.
How do you think?

izpyon · 2026-06-01T04:23:21Z

Dear @kencan7749 -san and @micchu -san,

Sorry for my late reply. I also apologize for not having been able to complete the task properly at the time.

As far as I remember, @micchu -san initially told me that, although the module would ultimately be used with data in the lab environment, we should use an open dataset for real-data testing in light of the openness of the test code. I also remember being advised to use a relatively recent dataset if possible. Based on these conditions, I selected the SoundRecon dataset.

If it would be helpful for me to upload the preprocessed data I have, I would be happy to do so.

Regarding the minimal dataset, I think using multiple sessions/runs makes sense, especially if we want to test the exclude flag properly.

One minor concern is that, at least in the actual lab environment I have seen, I do not often encounter a BIDS directory containing multiple subjects. Therefore, if maintaining a multi-subject minimal dataset would introduce additional cost or complexity, it might be worth considering whether multi-subject support is strictly necessary for this particular real-data test.

That said, multiple sessions/runs do seem useful for testing the expected behavior of the exclude option. I would also like to hear @micchu -san’s opinion on whether the minimal dataset should cover multiple subjects as well, or whether multiple sessions/runs within a single subject would be sufficient.

Thank you.

micchu · 2026-06-04T05:29:37Z

@kencan7749,

Sorry for the delayed reply. Some of this overlaps with the discussion we had in the regular meeting, but I will also leave my comments here.

Regarding where to host the data, I think Figshare would be more appropriate than OpenNeuro. OpenNeuro has a stronger implication that the data are being shared or published as neural activity data, whereas Figshare can be used for a broader range of datasets. For this test dataset, I think Figshare would be easier to use. In particular, if we package all the necessary files into a zip file, downloading the data should also be simpler with Figshare.
1. Given the nature of the data, I think we should also consider item ownership and who should manage the uploaded Figshare item. @kencan7749, could you please take a look at the “Data Publication on Figshare” page in Notion?
2. If we host the data on Figshare, I think the README should clearly document the original OpenNeuro dataset and the fMRIPrep / FreeSurfer versions. Since the output directory structure can differ depending on the fMRIPrep version, the zip filename should also make this identifiable.
For now, I think *_events.tsv and *_bold.json should also be included in the data uploaded to Figshare. As you mentioned, whether we should continue using *_bold.json as the source of TR information is worth considering in the future. However, at this point, I think it is better to follow the existing workflow.
I think it is a very good idea to take exclude into account. By including at least multiple sessions/runs, we should be able to test the behavior of exclude in a way that is closer to real data. Regarding whether to include multiple subjects as well, I think we should eventually support that. In future lab data structures, and also in external public datasets, multiple subjects may be listed. However, this was not part of the original scope, so for this issue, I think it would be better to use a single subject with multiple sessions/runs, and handle multi-subject coverage in the next follow-up issue.
I would like to confirm one point about the real_data marker. Currently, pyproject.toml registers the marker and adds a description, but I think this only registers the real_data marker with pytest; it does not automatically exclude those tests from the default pytest run. The description says “Excluded from CI”, but as pytest behavior, if we run pytest or pytest tests/, tests marked with real_data will still be collected. If we want to exclude real-data tests by default, I think we need either to add addopts = "-m 'not real_data'" under [tool.pytest.ini_options], or to explicitly specify pytest -m "not real_data" in the CI pytest command.
Could you revise the workflow so that it depends as little as possible on shell scripts? With the current direction, the normal real-data test workflow should not include the fMRIPrep execution phase. It should only download the data from Figshare, extract it, and verify the checksum.

@izpyon,

Yes, I was the one who asked you to use open data as much as possible. At that time, I had not fully considered that the workflow would require running fMRIPrep. I am very sorry about that.

If the fMRIPrep-processed data you created can be used, I think it would be good to use it as the source data for this test fixture. If the working directory still remains on the server, could you please let us know its path? If it has already been deleted, or if sharing it would be difficult, we can prepare the data separately, so please do not hesitate to let us know. Also, if there is any issue with my understanding or assumptions, please feel free to point it out.

izpyon · 2026-06-04T08:07:57Z

Dear @micchu -san, @kencan7749 -san,

I sincerely apologize for the inconvenience caused by not being able to complete this work on schedule.

I have confirmed that the data generated when I ran the workflow are still available on the HDD server, although access may be slow. I will share the server path together with the relevant fs env path and other related settings via DM.

Thank you very much.

izpyon added 3 commits April 7, 2026 10:04

Add tests for mri/fmriprep.py

d301ad7

Add test data for mock test

ead3e28

Add test data for mock test again

5e77c1e

kencan7749 requested review from izpyon and micchu May 11, 2026 05:13

kencan7749 self-assigned this May 11, 2026

micchu mentioned this pull request May 11, 2026

Add tests/mri fmriprep for fmriprep version 1.2.1 #115

Closed

kencan7749 force-pushed the ks_update_fmriprep branch 2 times, most recently from a72b8d9 to b0cd1b7 Compare May 11, 2026 08:06

kencan7749 force-pushed the ks_update_fmriprep branch from b0cd1b7 to 387a009 Compare May 11, 2026 08:15

ganow added the enhancement label May 13, 2026

This was referenced May 22, 2026

Fix string identity comparison in BrainData #124

Merged

create_bdata_fmriprep in bdpy.mri.fmriprep mutates OrderedDict` while iterating over it #125

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open up `bdpy.mri.fmriprep` tests: surface GM, split_task_label, real_data marker, test_metrics FP fix#116

Open up `bdpy.mri.fmriprep` tests: surface GM, split_task_label, real_data marker, test_metrics FP fix#116
kencan7749 wants to merge 4 commits into
KamitaniLab:devfrom
kencan7749:ks_update_fmriprep

kencan7749 commented May 11, 2026 •

edited

Loading

Uh oh!

micchu commented May 11, 2026

Uh oh!

kencan7749 commented May 11, 2026

Uh oh!

kencan7749 commented May 22, 2026 •

edited

Loading

Uh oh!

kencan7749 commented May 22, 2026 •

edited

Loading

Uh oh!

izpyon commented Jun 1, 2026

Uh oh!

micchu commented Jun 4, 2026

Uh oh!

izpyon commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kencan7749 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What the previous contributor did (detal in in the PR)

What I did in this PR

Known bugs and weak points, acknowledged but not yet fixed

Uh oh!

micchu commented May 11, 2026

Regarding the real-data workflow from #115

Regarding the real_data marker

Regarding surface_native golden-master coverage

Regarding known bugs and weak points

For the string identity comparisons in BrainData,

For the OrderedDict mutation during iteration in create_bdata_fmriprep(),

Uh oh!

kencan7749 commented May 11, 2026

Uh oh!

kencan7749 commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kencan7749 commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

izpyon commented Jun 1, 2026

Uh oh!

micchu commented Jun 4, 2026

Uh oh!

izpyon commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kencan7749 commented May 11, 2026 •

edited

Loading

Regarding the `real_data` marker

Regarding `surface_native` golden-master coverage

For the string identity comparisons in `BrainData`,

For the `OrderedDict` mutation during iteration in `create_bdata_fmriprep()`,

kencan7749 commented May 22, 2026 •

edited

Loading

kencan7749 commented May 22, 2026 •

edited

Loading