feat: parallelize viscosity calculation#388
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
872e464 to
a72d00f
Compare
|
@Atilaac one question: in the current "master" workflow, we run the viscosity after the melt quench. |
|
Since we do liquid viscosity, you can run them in parallel. Just start from the same random structure and cool it using the same cooling rate to the temperature of interest, so you can claim that this viscosity is from the same trajectory as the glass. |
Squashed: parallelize viscosity, import cleanup, workflow special-case, test fixes.
bf2e6bf to
aca1283
Compare
Viscosity tasks now start from the freshly generated random structure and depend only on structure_generation, so they fan out in parallel with the main melt-quench instead of waiting for it. Each task still performs its own melt-quench cooling to its target temperature.
a895385 to
6657c2f
Compare
|
@Atilaac / @Gitdowski I think I'm making a mistake in the LAMMPS settings in the viscosity run here - can you spot it? |
|
The relevant code should be in |
|
I will have a look |
|
I think I managed to find the problem and a fix. Now, should I push it here or in a separate PR? |
|
Thanks a lot - feel free to push directly here |
|
Just checking whether you forgot to push or whether you found something else to check |
Previously a failed pipeline blanket-marked every step 'failed' and stored the error under a generic 'pipeline' key. Now the failed branch probes the per-step executorlib caches to attribute the failure to the step that actually raised, falling back to the old behaviour when it cannot be localised.
|
I did not forget to push, I'm still working on it. |
fix: stop dangling SHIK melt block from corrupting downstream MD stages
_viscosity_simulation, elastic_simulation, and md_simulation each stripped
the melt pre-equilibration block using a substring pattern from the original block ("fix langevin ... 5000 5000"). The fixes were renamed before to langevinnve/ensemblenve and changed the temperature
to 4000 K, but these three strip sites were never updated, so a dangling Langevin thermostat + nve/limit integrator survived into every NPT/NVT stage, fighting the real ensemble and exploding the simulation.
Consolidate the melt block into a single source of truth
(lammps/potentials/_melt_block.py: melt_block_lines, strip_melt_block, set_melt_block_temperature), used by all six potential generators, the six melt-quench protocols, and the three downstream MD entry points.
Also:
- melt_quench_simulation now retunes the block to the caller's temperature_high instead of the generator's hardcoded default.
- SHIK viscosity equilibration runs at 0.1 GPa (matching the melt-quench protocol convention) instead of 0 GPa.
- melt_quench_simulation raises a clear ValueError when the resolved temperature_high equals temperature_low, instead of silently sending 0 heating/cooling steps to LAMMPS (which failed deep inside with an unrelated "Invalid dump frequency 0" error).
Fixes the "sample explodes" report for SHIK viscosity runs.
|
Can you check now if it is working on your sample? I tried many glasses with Li2O, mainly Li2O:20, SiO2:80 (the one you suggested), and Li2O:15, Na2O:18, SiO2:67. Both the viscosity and elastic moduli workflows are working locally for me. |
|
Thanks a lot @Atilaac , will do! |
|
Check the commit message; I have a description there. Some other changes are more about reducing code duplication and have nothing to do with the problem directly. |
|
btw, I notice you included a pixi.lock update in your commit. I will revert this since it led to problems on my side. Feel free to open a separate PR for this. |
Caused problems locally
On a partial rerun, executorlib reads a cached parent step's SLURM queue_id back out of its _o.h5 and injects it as an 'afterok' dependency for the resubmitted child. That id is long purged from the scheduler, so sbatch rejects the submission with 'Job dependency problem' (or emits afterok:None,None,None when no live id exists). _clear_executor_cache(failed_only=True) now strips the queue_id from every successful cache file it keeps, so executorlib omits the dependency entirely (the parent result is already on disk). Adds test_rerun_cache.py.
Run 3 different temperatures in parallel.