Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making it easier to set up and configure scripts with checkpointing on clusters with time limits #779

Closed
ali-ramadhan opened this issue Jun 23, 2020 · 4 comments
Labels

Comments

@ali-ramadhan
Copy link
Member

When setting up scripts to run on HPC clusters with strict job time limits you need to write scripts that continuously checkpoint and restore from checkpoint and submit the job script again once a wall time limit has been reached.

This introduces some complexity into the script, e.g. a if model.clock.iteration != 0 statement for setting initial conditions and potentially extra data wrangling if an output writer needs to be configured to append to an existing file. It's quite easy to forget to do something important and introduce bugs into these scripts which is undesirable as these jobs tend to run for a long time and we frequently submit many such jobs so the cost of a mistake can be quite high.

In the future I think developing some kind of abstraction for setting up this kind of script will be important for people who run scripts on HPC clusters with time limits. Not sure what to do but ideally it would minimize the chance of misconfiguration: it could allow users to take a regular script without checkpointing and somehow allow the script to be restored from checkpoint without having to worry about correctly setting initial conditions, configuring output writers, etc.

@ali-ramadhan
Copy link
Member Author

Would also be good if such a feature/framework/abstraction has some easy way to telling Slurm or LSF not to resubmit the script.

A hacky approach would be something like

if model.clock.time > end_time
    exec("touch done.tmp")
end

and the run script only resubmits if that file is not found

@glwagner
Copy link
Member

Is it possible to implement an alternative to run!(simulation) for this purpose?

One observation is that after a model is set up, the simulation can be advanced to a specified time and iteration by setting the "state" of the model (velocity fields, tracer fields, and tendency fields), and resetting the model clock time and iteration.

Therefore, we could design an abstraction that runs a simulation from the latest checkpoint found in a particular directory (or something like that). If no checkpoint is found, run!(simulation) is called as usual. If a checkpoint is found, the model state, time, and iteration are updated using checkpointed data prior to calling run!(simulation).

Perhaps something like

continue!(simulation; checkpoint_dir=".")

@glwagner
Copy link
Member

and the run script only resubmits if that file is not found

I'm not sure this would work, but could we also use environment variables for this? There could be an environment variable called SIMULATION_COMPLETE or something like that.

I don't think I completely understand how the resubmission works, so an explanation of that might be helpful. I guess there is a slurm script with a few lines at the end?

@ali-ramadhan
Copy link
Member Author

Pretty sure this was resolved with PR #1082.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants