Feature idea: keyword "pickup" in `run!` #1068

glwagner · 2020-10-14T18:48:14Z

Currently restoring a model from a checkpoint uses the function restore_from_checkpoint. However, this design can be difficult to use because, in general, restoring a model requires passing the forcing functions and boundary condition functions explicitly to restore_from_checkpoint. This means that part of the setup script needs to be replicated in the script that restores a model from the checkpoint.

A different solution is possible that may require less work on the part of users to restore a model from a checkpoint: if we extend run! with the keyword argument pickup.

If pickup=true (and a checkpointer has been added to simulation.output_writers), then prior to executing the run! loop,

The checkpointer file correpsonding to the most recent model iteration will be identified;
Checkpointer data will be synced with model.velocities, model.tracers, model.clock, and, for QuasiAdamsBashforth2 timesteppers, the tendency model.timesteppers.Gⁿ.

For this to work seamlessly when a JLD2OutputWriter is involved we might need to modify the constructor for JLD2OutputWriter. Currently the constructor runs an arbitrary init(file, model) function, and saves certain properties identified in the constructor signature:

Oceananigans.jl/src/OutputWriters/jld2_output_writer.jl

Lines 157 to 160 in 9cfb568

    
           jldopen(filepath, "a+"; jld2_kw...) do file 
        
               init(file, model) 
        
               saveproperties!(file, model, including) 
        
           end

This will fail if the output file already exists; thus it's not possible to pickup from a checkpoint with the same script used to initialize the model (without destroying prior output). A simple change could just be to allow init and save_properties! to fail with try/catch (or to avoid running those functions if a file already exists). There are also some shenanigans that'd have to be done for output files split into multiple parts. (We could, potentially, require pickup=true in the JLD2OutputWriter constructor, rather than changing its default behavior to accommodate auto-checkpoint-pickup).

Another caveat is that this method of restoring checkpointed data will not work for large CPU models that consume almost all of the CPU memory (such that a single field cannot be loaded from file after model has been instantiated). These cases are relatively rare right now, since such large models would typically run very slowly on a typical single node.

The basic idea is:

# Model and simulation setup

simulation.output_writers[:checkpointer] = Checkpointer(...)

run!(simulation, pickup=true)

We could also allow pickup to be an iteration number, eg

# Model and simulation setup

simulation.output_writers[:checkpointer] = Checkpointer(...)

run!(simulation, pickup=10138)

It may also be possible to enable this functionality with an environment variable; eg

PICKUP=true julia --project run_cool_simulation.jl

Note that this design works even if model.clock.iteration==0, since the initial checkpoint can be picked up.

The text was updated successfully, but these errors were encountered:

glwagner · 2020-10-14T18:48:30Z

@christophernhill feedback appreciated!

glwagner · 2020-10-16T22:59:08Z

Related issue: #779

ali-ramadhan · 2020-10-17T16:04:09Z

A consideration when picking up from a checkpoint and using NetCDFOutputWriter is that mode="a" (append) needs to be used instead of mode="c" (create or clobber) when creating the NetCDFOutputWriter. This functionality works and is tested, but currently needs to be set manually by the user.

Not sure of the best way of making this easy for users without accidentally overwriting their data.

I can think of three solutions:

Not specifying a mode causes mode="c" if the file does not exist and mode="a" if the file does exist. I like this solution the most as it works well with and without a checkpointer (and users don't have to do anything to get reasonable default behavior).
Add a force kwarg to NetCDFOutputWriter that is false by default. The NetCDFOutputWriter will error if you try to overwrite an existing file, allowing the user to go back and set mode="a" without any data loss. A pickup kwarg could perform a similar function if it's false by default.
Setting the PICKUP environment variable causes mode="a" to be the default if the file already exists. But I think we should avoid using global environment variables to modify internal behavior.

ali-ramadhan · 2020-10-17T16:08:06Z

Another caveat is that this method of restoring checkpointed data will not work for large CPU models that consume almost all of the CPU memory (such that a single field cannot be loaded from file after model has been instantiated). These cases are relatively rare right now, since such large models would typically run very slowly on a typical single node.

Not sure how it would work with JLD2 but Base.read! can fill an array by reading data from disk: https://docs.julialang.org/en/v1/base/io-network/#Base.read!

ali-ramadhan · 2020-10-17T16:16:18Z

I like this idea a lot! It would make checkpointing and picking up from a checkpoint much easier, especially for users.

How would this work if you have one spin up script and another script for the interesting part of the simulation that picks up from the checkpoint made at the end of spin up?

I guess there are two options for pickup: true would pick up from the latest checkpoint, while an integer 12345 would pick up from the checkpoint made at iteration 12345.

If we add a third option: pickup=some_filepath::String then run! can pick up from the checkpoint file located at some_filepath. This would enable scripts to pick up from checkpoints produced by any other script (provided the grid is the same etc.).

X-Ref: #602

glwagner · 2020-10-17T16:40:50Z

How would this work if you have one spin up script and another script for the interesting part of the simulation that picks up from the checkpoint made at the end of spin up?

This is an interesting and I think common use case that's worth thinking about (cc @sandreza). The current mode, which I think works well, is to build a new model restore_from_checkpoint but reassigning some properties of model (as well as including the ones that couldn't be checkpointed).

(Thinking about it more, we might almost rename restore_from_checkpoint to IncompressibleModel, since it sort of is an alternative constructor for IncompressibleModel...)

If we deprecate restore_from_checkpoint, then users would be required to rebuild IncompressibleModel from scratch and use a function

set!(model, checkpoint_filepath::String)

which would be implemented as part of the pickup feature we're discussing. I think this method has pros and cons over the "constructor" method:

Cons

It generates boilerplate in user scripts, since the second script will inevitably reproduce much of the original "spinup" run script.
It wastes memory.

Pro

The setup script for the second simulation is easier to interpret, since its more "stand-alone", containing all the information necessary to understand basic aspects of the model configuration, like the grid, coriolis, buoyancy, etc. In other words, using restore_from_checkpoint ties two scripts together, since one cannot understand the script containing restore_from_checkpoint without looking at the script that produced the checkpoint in the first place.

I think we can mitigate Con.1 by designing helper functions that restore individual model properties. Something like

grid = restore(checkpoint_filepath, :grid)
coriolis = restore(checkpoint_filepath, :coriolis)

(This kind of feature is important for post processing with the the other output writers too.)

I guess we can't solve Con.1 without diluting Pro.1.

glwagner added feature 🌟 Something new and shiny output 💾 labels Oct 14, 2020

This was referenced Oct 19, 2020

More user-friendly JLD2OutputWriter #963

Closed

New checkpointer features: set! and simulation "pickup" #1082

Merged

glwagner closed this as completed in #1082 Oct 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature idea: keyword "pickup" in `run!` #1068

Feature idea: keyword "pickup" in `run!` #1068

glwagner commented Oct 14, 2020 •

edited

Loading

glwagner commented Oct 14, 2020

glwagner commented Oct 16, 2020

ali-ramadhan commented Oct 17, 2020

ali-ramadhan commented Oct 17, 2020 •

edited

Loading

ali-ramadhan commented Oct 17, 2020

glwagner commented Oct 17, 2020

Feature idea: keyword "pickup" in run! #1068

Feature idea: keyword "pickup" in run! #1068

Comments

glwagner commented Oct 14, 2020 • edited Loading

glwagner commented Oct 14, 2020

glwagner commented Oct 16, 2020

ali-ramadhan commented Oct 17, 2020

ali-ramadhan commented Oct 17, 2020 • edited Loading

ali-ramadhan commented Oct 17, 2020

glwagner commented Oct 17, 2020

Cons

Pro

Feature idea: keyword "pickup" in `run!` #1068

Feature idea: keyword "pickup" in `run!` #1068

glwagner commented Oct 14, 2020 •

edited

Loading

ali-ramadhan commented Oct 17, 2020 •

edited

Loading