You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When setting up scripts to run on HPC clusters with strict job time limits you need to write scripts that continuously checkpoint and restore from checkpoint and submit the job script again once a wall time limit has been reached.
This introduces some complexity into the script, e.g. a if model.clock.iteration != 0 statement for setting initial conditions and potentially extra data wrangling if an output writer needs to be configured to append to an existing file. It's quite easy to forget to do something important and introduce bugs into these scripts which is undesirable as these jobs tend to run for a long time and we frequently submit many such jobs so the cost of a mistake can be quite high.
In the future I think developing some kind of abstraction for setting up this kind of script will be important for people who run scripts on HPC clusters with time limits. Not sure what to do but ideally it would minimize the chance of misconfiguration: it could allow users to take a regular script without checkpointing and somehow allow the script to be restored from checkpoint without having to worry about correctly setting initial conditions, configuring output writers, etc.
The text was updated successfully, but these errors were encountered:
Is it possible to implement an alternative to run!(simulation) for this purpose?
One observation is that after a model is set up, the simulation can be advanced to a specified time and iteration by setting the "state" of the model (velocity fields, tracer fields, and tendency fields), and resetting the model clock time and iteration.
Therefore, we could design an abstraction that runs a simulation from the latest checkpoint found in a particular directory (or something like that). If no checkpoint is found, run!(simulation) is called as usual. If a checkpoint is found, the model state, time, and iteration are updated using checkpointed data prior to calling run!(simulation).
and the run script only resubmits if that file is not found
I'm not sure this would work, but could we also use environment variables for this? There could be an environment variable called SIMULATION_COMPLETE or something like that.
I don't think I completely understand how the resubmission works, so an explanation of that might be helpful. I guess there is a slurm script with a few lines at the end?
When setting up scripts to run on HPC clusters with strict job time limits you need to write scripts that continuously checkpoint and restore from checkpoint and submit the job script again once a wall time limit has been reached.
This introduces some complexity into the script, e.g. a
if model.clock.iteration != 0
statement for setting initial conditions and potentially extra data wrangling if an output writer needs to be configured to append to an existing file. It's quite easy to forget to do something important and introduce bugs into these scripts which is undesirable as these jobs tend to run for a long time and we frequently submit many such jobs so the cost of a mistake can be quite high.In the future I think developing some kind of abstraction for setting up this kind of script will be important for people who run scripts on HPC clusters with time limits. Not sure what to do but ideally it would minimize the chance of misconfiguration: it could allow users to take a regular script without checkpointing and somehow allow the script to be restored from checkpoint without having to worry about correctly setting initial conditions, configuring output writers, etc.
The text was updated successfully, but these errors were encountered: