Making it easier to set up and configure scripts with checkpointing on clusters with time limits #779

ali-ramadhan · 2020-06-23T13:53:50Z

When setting up scripts to run on HPC clusters with strict job time limits you need to write scripts that continuously checkpoint and restore from checkpoint and submit the job script again once a wall time limit has been reached.

This introduces some complexity into the script, e.g. a if model.clock.iteration != 0 statement for setting initial conditions and potentially extra data wrangling if an output writer needs to be configured to append to an existing file. It's quite easy to forget to do something important and introduce bugs into these scripts which is undesirable as these jobs tend to run for a long time and we frequently submit many such jobs so the cost of a mistake can be quite high.

In the future I think developing some kind of abstraction for setting up this kind of script will be important for people who run scripts on HPC clusters with time limits. Not sure what to do but ideally it would minimize the chance of misconfiguration: it could allow users to take a regular script without checkpointing and somehow allow the script to be restored from checkpoint without having to worry about correctly setting initial conditions, configuring output writers, etc.

The text was updated successfully, but these errors were encountered:

ali-ramadhan · 2020-06-25T14:00:36Z

Would also be good if such a feature/framework/abstraction has some easy way to telling Slurm or LSF not to resubmit the script.

A hacky approach would be something like

if model.clock.time > end_time
    exec("touch done.tmp")
end

and the run script only resubmits if that file is not found

glwagner · 2020-06-29T18:30:15Z

Is it possible to implement an alternative to run!(simulation) for this purpose?

One observation is that after a model is set up, the simulation can be advanced to a specified time and iteration by setting the "state" of the model (velocity fields, tracer fields, and tendency fields), and resetting the model clock time and iteration.

Therefore, we could design an abstraction that runs a simulation from the latest checkpoint found in a particular directory (or something like that). If no checkpoint is found, run!(simulation) is called as usual. If a checkpoint is found, the model state, time, and iteration are updated using checkpointed data prior to calling run!(simulation).

Perhaps something like

continue!(simulation; checkpoint_dir=".")

glwagner · 2020-06-29T18:36:23Z

and the run script only resubmits if that file is not found

I'm not sure this would work, but could we also use environment variables for this? There could be an environment variable called SIMULATION_COMPLETE or something like that.

I don't think I completely understand how the resubmission works, so an explanation of that might be helpful. I guess there is a slurm script with a few lines at the end?

ali-ramadhan · 2021-03-17T02:49:59Z

Pretty sure this was resolved with PR #1082.

ali-ramadhan added abstractions 🎨 Whatever that means output 💾 labels Jun 23, 2020

glwagner mentioned this issue Oct 16, 2020

Feature idea: keyword "pickup" in run! #1068

Closed

ali-ramadhan mentioned this issue Nov 2, 2020

Example/tutorial on checkpointing #1136

Closed

ali-ramadhan closed this as completed Mar 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making it easier to set up and configure scripts with checkpointing on clusters with time limits #779

Making it easier to set up and configure scripts with checkpointing on clusters with time limits #779

ali-ramadhan commented Jun 23, 2020

ali-ramadhan commented Jun 25, 2020

glwagner commented Jun 29, 2020

glwagner commented Jun 29, 2020

ali-ramadhan commented Mar 17, 2021

Making it easier to set up and configure scripts with checkpointing on clusters with time limits #779

Making it easier to set up and configure scripts with checkpointing on clusters with time limits #779

Comments

ali-ramadhan commented Jun 23, 2020

ali-ramadhan commented Jun 25, 2020

glwagner commented Jun 29, 2020

glwagner commented Jun 29, 2020

ali-ramadhan commented Mar 17, 2021