Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature idea: keyword "pickup" in run! #1068

Closed
glwagner opened this issue Oct 14, 2020 · 6 comments · Fixed by #1082
Closed

Feature idea: keyword "pickup" in run! #1068

glwagner opened this issue Oct 14, 2020 · 6 comments · Fixed by #1082
Labels
feature 🌟 Something new and shiny output 💾

Comments

@glwagner
Copy link
Member

glwagner commented Oct 14, 2020

Currently restoring a model from a checkpoint uses the function restore_from_checkpoint. However, this design can be difficult to use because, in general, restoring a model requires passing the forcing functions and boundary condition functions explicitly to restore_from_checkpoint. This means that part of the setup script needs to be replicated in the script that restores a model from the checkpoint.

A different solution is possible that may require less work on the part of users to restore a model from a checkpoint: if we extend run! with the keyword argument pickup.

If pickup=true (and a checkpointer has been added to simulation.output_writers), then prior to executing the run! loop,

  1. The checkpointer file correpsonding to the most recent model iteration will be identified;
  2. Checkpointer data will be synced with model.velocities, model.tracers, model.clock, and, for QuasiAdamsBashforth2 timesteppers, the tendency model.timesteppers.Gⁿ.

For this to work seamlessly when a JLD2OutputWriter is involved we might need to modify the constructor for JLD2OutputWriter. Currently the constructor runs an arbitrary init(file, model) function, and saves certain properties identified in the constructor signature:

jldopen(filepath, "a+"; jld2_kw...) do file
init(file, model)
saveproperties!(file, model, including)
end

This will fail if the output file already exists; thus it's not possible to pickup from a checkpoint with the same script used to initialize the model (without destroying prior output). A simple change could just be to allow init and save_properties! to fail with try/catch (or to avoid running those functions if a file already exists). There are also some shenanigans that'd have to be done for output files split into multiple parts. (We could, potentially, require pickup=true in the JLD2OutputWriter constructor, rather than changing its default behavior to accommodate auto-checkpoint-pickup).

Another caveat is that this method of restoring checkpointed data will not work for large CPU models that consume almost all of the CPU memory (such that a single field cannot be loaded from file after model has been instantiated). These cases are relatively rare right now, since such large models would typically run very slowly on a typical single node.

The basic idea is:

# Model and simulation setup

simulation.output_writers[:checkpointer] = Checkpointer(...)

run!(simulation, pickup=true)

We could also allow pickup to be an iteration number, eg

# Model and simulation setup

simulation.output_writers[:checkpointer] = Checkpointer(...)

run!(simulation, pickup=10138)

It may also be possible to enable this functionality with an environment variable; eg

PICKUP=true julia --project run_cool_simulation.jl

Note that this design works even if model.clock.iteration==0, since the initial checkpoint can be picked up.

@glwagner
Copy link
Member Author

@christophernhill feedback appreciated!

@glwagner glwagner added feature 🌟 Something new and shiny output 💾 labels Oct 14, 2020
@glwagner
Copy link
Member Author

Related issue: #779

@ali-ramadhan
Copy link
Member

A consideration when picking up from a checkpoint and using NetCDFOutputWriter is that mode="a" (append) needs to be used instead of mode="c" (create or clobber) when creating the NetCDFOutputWriter. This functionality works and is tested, but currently needs to be set manually by the user.

Not sure of the best way of making this easy for users without accidentally overwriting their data.

I can think of three solutions:

  1. Not specifying a mode causes mode="c" if the file does not exist and mode="a" if the file does exist. I like this solution the most as it works well with and without a checkpointer (and users don't have to do anything to get reasonable default behavior).
  2. Add a force kwarg to NetCDFOutputWriter that is false by default. The NetCDFOutputWriter will error if you try to overwrite an existing file, allowing the user to go back and set mode="a" without any data loss. A pickup kwarg could perform a similar function if it's false by default.
  3. Setting the PICKUP environment variable causes mode="a" to be the default if the file already exists. But I think we should avoid using global environment variables to modify internal behavior.

@ali-ramadhan
Copy link
Member

ali-ramadhan commented Oct 17, 2020

Another caveat is that this method of restoring checkpointed data will not work for large CPU models that consume almost all of the CPU memory (such that a single field cannot be loaded from file after model has been instantiated). These cases are relatively rare right now, since such large models would typically run very slowly on a typical single node.

Not sure how it would work with JLD2 but Base.read! can fill an array by reading data from disk: https://docs.julialang.org/en/v1/base/io-network/#Base.read!

@ali-ramadhan
Copy link
Member

I like this idea a lot! It would make checkpointing and picking up from a checkpoint much easier, especially for users.

How would this work if you have one spin up script and another script for the interesting part of the simulation that picks up from the checkpoint made at the end of spin up?

I guess there are two options for pickup: true would pick up from the latest checkpoint, while an integer 12345 would pick up from the checkpoint made at iteration 12345.

If we add a third option: pickup=some_filepath::String then run! can pick up from the checkpoint file located at some_filepath. This would enable scripts to pick up from checkpoints produced by any other script (provided the grid is the same etc.).

X-Ref: #602

@glwagner
Copy link
Member Author

How would this work if you have one spin up script and another script for the interesting part of the simulation that picks up from the checkpoint made at the end of spin up?

This is an interesting and I think common use case that's worth thinking about (cc @sandreza). The current mode, which I think works well, is to build a new model restore_from_checkpoint but reassigning some properties of model (as well as including the ones that couldn't be checkpointed).

(Thinking about it more, we might almost rename restore_from_checkpoint to IncompressibleModel, since it sort of is an alternative constructor for IncompressibleModel...)

If we deprecate restore_from_checkpoint, then users would be required to rebuild IncompressibleModel from scratch and use a function

set!(model, checkpoint_filepath::String)

which would be implemented as part of the pickup feature we're discussing. I think this method has pros and cons over the "constructor" method:

Cons

  1. It generates boilerplate in user scripts, since the second script will inevitably reproduce much of the original "spinup" run script.

  2. It wastes memory.

Pro

  1. The setup script for the second simulation is easier to interpret, since its more "stand-alone", containing all the information necessary to understand basic aspects of the model configuration, like the grid, coriolis, buoyancy, etc. In other words, using restore_from_checkpoint ties two scripts together, since one cannot understand the script containing restore_from_checkpoint without looking at the script that produced the checkpoint in the first place.

I think we can mitigate Con.1 by designing helper functions that restore individual model properties. Something like

grid = restore(checkpoint_filepath, :grid)
coriolis = restore(checkpoint_filepath, :coriolis)

(This kind of feature is important for post processing with the the other output writers too.)

I guess we can't solve Con.1 without diluting Pro.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature 🌟 Something new and shiny output 💾
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants